Quantcast
Channel: Hacker News
Viewing all 25817 articles
Browse latest View live

The Woman Who Gave the Macintosh a Smile

$
0
0

Every fifteen minutes or so, as I wrote this story, I moved my cursor northward to click on the disk in the Microsoft Word toolbar that indicates “Save.” This is a superstitious move, as my computer automatically saves my work every ten minutes. But I learned to use a computer in the era before AutoSave, in the dark ages when remembering to save to a disk often stood between you and term-paper disaster. The persistence of that disk icon into the age of flash drives and cloud storage is a sign of its power. A disk means “Save.” Susan Kare designed a version of that disk, as part of the suite of icons that made the Macintosh revolutionary—a computer that you could communicate with in pictures.

Paola Antonelli, the senior curator of architecture and design at the Museum of Modern Art, was the first to physically show Kare’s original icon sketches, in the 2015 exhibit “This is for Everyone.” “If the Mac turned out to be such a revolutionary object––a pet instead of a home appliance, a spark for the imagination instead of a mere work tool––it is thanks to Susan’s fonts and icons, which gave it voice, personality, style, and even a sense of humor. Cherry bomb, anyone?” she joked, referring to the icon which greeted crashes in the original operating system. After working for Apple, Kare designed icons for Microsoft, Facebook, and, now, Pinterest, where she is a creative director. The mainstream presence of Pinterest, Instagram, Snapchat, emoji, and GIFS is a sign that the visual revolutionaries have won: online, we all communicate visually, piecing together sentences from tiny-icon languages.

Kare, who is sixty-four, will be honored for her work on April 20th, by her fellow designers, with the prestigious AIGA medal. In 1982, she was a sculptor and sometime curator when her high-school friend Andy Hertzfeld asked her to create graphics for a new computer that he was working on in California. Kare brought a Grid notebook to her job interview at Apple Computer. On its pages, she had sketched, in pink marker, a series of icons to represent the commands that Hertzfeld’s software would execute. Each square represented a pixel. A pointing finger meant “Paste.” A paintbrush symbolized “MacPaint.” Scissors said “Cut.” Kare told me about this origin moment: “As soon as I started work, Andy Hertzfeld wrote an icon editor and font editor so I could design images and letterforms using the Mac, not paper,” she said. “But I loved the puzzle-like nature of working in sixteen-by-sixteen and thirty-two-by-thirty-twopixel icon grids, and the marriage of craft and metaphor.”

What Kare lacked in computer experience she made up for in visual knowledge. “Bitmap graphics are like mosaics and needlepoint and other pseudo-digital art forms, all of which I had practiced before going to Apple,” she told an interviewer, in 2000. The command icon, still right there to the left of your space bar, was based on a Swedish campground sign meaning “interesting feature,” pulled from a book of historical symbols. Kare looked to cross-stitch, to mosaics, to hobo signs for inspiration when she got stuck. “Some icons, like the piece of paper, are no problem; but others defy the visual, like ‘Undo.’ ” At one point, there was to be an icon of a copy machine for making a copy of a file, and users would drag and drop a file onto it to copy it, but it was difficult to render a copier at that scale. Kare also tried a cat in a mirror, for copycat. Neither made the cut. She also designed a number of the original Mac fonts, including Geneva, Chicago, and the picture-heavy Cairo, using only a nine-by-seven grid.

Her notebooks are part of the permanent collections of the New York and San Francisco modern-art museums, and one was included in the recent London Design Museum exhibit “California: Designing Freedom.” Justin McGuirk, the co-curator of that exhibition, said, “The Xerox Star initiated the metaphor of the ‘desktop’ as an icon-based method of interacting with computers, but it was the Apple Mac that popularized it.” While the Macintosh once made you wait with a tiny watch designed by Kare, Pinterest offers you a spinning button when you refresh, also designed by Kare. Last fall, the small home-design brand Areaware débuted Kare-designed placemats, coasters, and napkins with bitmap raindrops, waves, and diagonals; I bought them for the whole family for Christmas.

“It’s fun to read that, before there was social media, countless people spent hours with Microsoft Windows Solitaire using the cards I designed,” she said. In 2008, Kare created virtual “gifts” for Facebook that you could buy and send to a friend, with new offerings daily, based on a sixty-four-by-sixty-four-pixel grid. The best-sellers played to the crowd: hearts, penguins, and kisses, like a digital box of chocolates. A sixty-four-pixel palette would seem like a big step up, but Kare doesn’t think detail necessarily makes better icons. “Simple images can be more inclusive,” she said. Look at traffic signs: “There’s a reason the silhouettes of kids in a school crossing sign don’t have plaid lunchboxes and superhero backpacks, even though it’s not because of technology limitations,” she said. “Those would be extraneous details.”

Kare’s personal style is distinctly unfussy. She was bemused last year when her son and colleagues at Pinterest alerted her that a 1984 portrait of her by Norman Seeff, taken for Rolling Stone, had turned up on Reddit in the subreddit of “old school cool.” In the photo, Kare lounges horizontally in her ergonomic chair, wearing jeans and a gray sweatshirt, with one gray-and-burgundy New Balance shoe propped on her desk. “Just a regular 1984 work outfit—nothing special—but seems ‘pre-normcore’ in retrospect,” she said. “I lived in New Balance and Reebok ankle-high workout shoes. Colleagues brought me toy robot souvenirs from work trips to Japan, and I see postcards of favorite images from the Metropolitan Museum.” The toys, the art, and the sneakers embody the rigor and the humor that Kare has always brought to the task of making icons, which resonate across the decades.

A redditor helpfully identified the robots—Monster from Macross (1983), MR-11 Bulldozer Robo/Dozer (1982), and, on Twitter, Daniel Mallory Ortberg made a proposal:

In a 2000 interview with Alex Soojung-Kim Pang, now a researcher at Institute for the Future, Kare brings the history of American graphic design full circle. It was she who brought the legendary Paul Rand (the AIGA Medal winner in 1966 and a designer of I.B.M.) to the attention of Steve Jobs when the latter founded NeXT, in 1985, and needed a logo as iconic as the Apple. Rand’s solution, presented in a hundred-page booklet, for which he was paid a hundred thousand dollars, was a black box poised on one corner, mimicking the distinctive and problematic appearance of the computers themselves. Letters are easier to fit into a perfect cube than motherboards. “Don’t get scared, this is not the design,” Rand quips in a video of the presentation, taking out the book. Kare stands next to Rand in a star-spangled sweater. “Steve and I both learned a lot from him,” she said. “He was very unequivocal. I remember him almost pounding the table, saying, ‘I’ve been doing this for fifty-five years, and I know what you should do!’ It must be great to have that much confidence in such an inexact science.”

I asked Kare if there were other AIGA medalists, besides Rand, whom she saw as influences, and she lists a series of pre-digital greats whose work is known for broad appeal, infectious warmth, and a sometimes cartoony hand: Charles and Ray Eames (1977), Milton Glaser (1972), and the New Yorker contributor and cartoonist Saul Steinberg (1963). Through their work, and now hers, one can see a legacy of personal touch that one hopes will continue into our digital future on a deeper level than fingerprint readers. She gave the Mac a smile—where’s the smile now?


Mathematics I Use (2012)

$
0
0

Recently, on an online forum, a question was posed: How much, and what kind, of mathematics does a working programmer actually use? Here is my answer.

First, I and almost all programmers use a lot ofboolean logic, from evaluating boolean expressions for conditionals and loop exit criteria, to rearranging the terms of such expressions according to, e.g.,De Morgan's laws. Much of our work borders on thefirst-order predicate calculus and otherpredicate logics in the guise of analysis of preconditions, invariants, etc (though it may not always be presented as such).

Next, I do a lot of performance analysis. The kind of data sets we process these days are massive. In 2010, made a comment at theTechonomy conference that we (humans) produce as much data in two days as ever existed world-wide in 2003. I want to be able to process large chunks of that and infer things from it, and understanding thespace and time complexity of the operations we apply to the data is critical to determining whether the computations are even feasible. Further, unlike in much traditionalbig-O ortheta analysis, the constant factors matter very much at that kind of scale: a factor of 2 will not change the asymptotic time complexity of an algorithm, but if it means the difference between running it over 10,000 or 20,000 processors, now we are talking about real resources. The calculations tend to be much more intricate as a result. Examples: can I take some linear computation and reduce it in strength to a logarithmic computation? Can I reduce memory usage by a factor of three? Etc.

Often times, I want to compute the worst-case or upper bound of, say, the size of some data set. The calculations can be nontrivial for many of these. Or, I may want to analyze somerecurrence relation to see how it varies as I increase the recursion depth. To do that, I need, among other things, theMaster Theorem and a good understanding of how to analyzeseries. Believe it or not, this sometimes means I need to evaluate anintegral (though mostly of theRiemann variety). Or can I just solve the recurrence and get aclosed-form solution? Do I have to resort tolinear algebra? This gets into things likegenerating functions,Stirling numbers,, etc. If you are curious what goes into“fundamental” mathematical concepts necessary to understand computer science, have a look at volume 1 of, The Art of Computer Programming, by, orConcrete Mathematics, by, and.

I do a lot of straight-up computation in terms of aggregating, combining and transforming data; lots ofcombinatorics (e.g., counting things, looking for symmetries across different dimensions, etc). Examples are obvious, I think.

I do a lot ofdiscrete mathematics: looking foralgebraic structures across operations on extremely large sets. Is there some kind of structure latent in whatever I am doing that can be preserved acrosshomomorphism to somegroup orring that I understand better? Is there a looser constraint? Can I applygroup actions to some set in order to build a mental model for some transformation that makes it easier to reason about? Can I define sometopology towards analyzing the data? You would be surprised how many things fall into the category ofdiscrete topologies. For that matter, you would be surprised how many places thetriangle inequality shows up.

I do a lot ofgraph theory.“Designing a web site” is not just about where to put the cute cat graphic on the page, it is also about inserting nodes into the global hyperlink graph; a single page potentially adds many edges to the graph and that can have subtle effects on performance, analysis, search engine rankings, etc. Understanding the consequences of that can yield interesting insights: how does the graph grow? As it turns out, itlooks an awful lot like it obeys apower law: the web is ascale-free network. What is theshortest distance between two nodes in that graph? What would it mean for the web graph to beplanar?Bipartite? When, if ever, do these properties hold? What if the graph isn't the web, but entire road network for North America, Europe or Asia?

This implies something else. Something people often do not realize about “modern” web pages is that they are not just HTML documents with links to images and other resources, but they are reallytree structures of data that are linked together in agraph. Those trees are often walked over, reprocessed and updated dynamically by interactions between the user's web browser and some server (this is what“AJAX” is all about). For a clever and relevant example of this, see: MathJax. OrGMail. Understanding how to do that means having some understanding ofsymbolic computation andsemantic analysis of the elements of a page. For MathJax, the authors must write a program that can walk a tree generated from theDocument Object Model (or DOM), look for mathematical elements,parse them, and dynamically replace them with new rendered elements representing the underlying expression. It may not seem like much to most users, for whom it “just works”, but it is actually some heady stuff under the hood. I do not do things like that necessarily (I am not a front-end person), but I do do similar things in Lisp all the time. Note that Lisp was originally defined as a mathematical notation for symbolic processing: Lisp macros are all about manipulation of symbolic expressions.

I do a lot oftime-series analysis. How is traffic or resource consumption changing? What are the trends? Is a spike in request latency or memory useseasonal? How does therate of change of something vary as input changes in different dimensions? Is itcorrelated with some external event?

I do a lot ofstatistical analysis of data, not just to understand its performance characteristics but also to understand the data itself. In addition to looking at the aforementioned DOM tree for semantic metadata (e.g.,microdata andmicroformats,RDFa, otherXML data with some explicitschema, etc) also trying to make sense ofunstructured data. What is theprobability that this text is a street address?Geographical coordinates? What context does it appear in? Is itspam? Does it make sense? Does it look like the output of aMarkov chain generator? Is it a series of exact quotes from some well-known work of literature? Or is it some discussion about literature? Is it a discussion about spam that includes literature? I still chuckle when I think about the piece of spam I got for pharmaceuticals wrapped inside a section of'sThe Master and Margarita.

Category theory.Types in computer programming languages roughly correspond to categories, andmonads andfunctors can be brought to bear to greatly simplify some constructs in surprisingly elegant ways. For instance, thefunctional programming languageHaskell usesmonads forinput and output, and for modelingstate. Simplified programs are easier to get right, easier to reason about, understand, modify, etc. Types can often be inferred; this brings in things likeunification (which can also be used in general problems of inference). Consider usinginference to applyprolog-style predicates as an approach totransforming graphs in adistributed system.

Distributed systems bring us back to graph theory: at scale, in the real world, systems fail, backhoes cut fiber, there are earthquakes, volcanoes, and fishing trawlers that disturb Marine cables. One needs to understand the graph characteristics of the network to understand the effects of these things and how best to respond. Routing algorithms and network analysis is intimately tied to things like how to find theshortest path between nodes in the network graph;Dijkstra's algorithm anyone?

Also, how does one distribute a large computation across globally distributed data centers? You have to understand some physics to do this well: at Internet scale, thespeed of light starts to be a bottleneck.Heat dissipation,density of electrical current draw per unit area, etc, are all real world considerations that go into what programmers do. Should I put a data center in Iceland? Cheap cooling and geothermal energy sound appealing, but what is the minimum latency to some place where user's care abut the data living on those servers in Iceland? That is thegreat-circle distance between, say, Iceland and London? Berlin? Amsterdam? These are fairly simple things to figure out, but I need to have enough mathematical chops to do them. Can I run fiber from Iceland to some hub location? What is the average latency? What is the probability of a fiber break in a marine cable under the North Sea over a 12 month period? 48 months?

Of course, thetheory of computation andautomata,parsing,grammars,regular languages, etc, all enter into what programmers' work. I do a lot of parsing andpattern matching. At even moderate size, real-world data sets contain items that can triggerpathologically bad behavior when using, for instance,backtracking techniques. If I useregular expressions to match data, I must be careful to make sure that the expressionsreally are regular. If I am using apush-down automaton to parse acontext-free grammar (which happens every time you send a request to anHTTP server, by the way), I have to make sure I limit the depth of recursion to avoid things like processorprocedure call stack exhaustion, which requires understanding the underlying principles of the computation and the mathematics they are based on. If I have to actually write arecursive descent parser for some funky grammar that is notLALR(1) (so I can't just useyacc orbison), I have to be careful or maintain the state stack separately from procedural recursion. That this is also something I need to understand if I am walking a DOM tree (or any recursively defined data structure).Some programming languages recognize this as a hassle for the programmer and work around it by usingsegmented stacks. Of course, it would be nice if I could define my“compilation” of some parsable resource as a function (in the mathematical sense). Wouldn't it be nice if this was all just somelinear programmingoptimization problem?

Note that none of this is esoterica; it is all based on real-world experience with real-world data sets and problems. Of course, I do not do all of this every day, but I have done all of it at one time or another and most of it regularly. Probably a lot more is based on observation, experience andheuristics than should be (the heuristic models are often incomplete and inaccurate). Do I know enough math to calculate the averageerror between reality and my heuristic model?

This is what computer science really is, and how it interacts with programming and the realities of modern computing. Being the “IT expert” somewhere is not the same thing as being a computer scientist, and as many correctly note, being a computer scientist is a lot closer to being an applied mathematician than a tradesman. This is not to downplay the importance of trade professions, which are both useful and highly respectable, but to point out that computer science is different. I am very fortunate that I was able to serve in the Marines with a number of folks who work in trades; we as a society should show them a lot more respect than we do.

(For the record, I am not a computer scientist. I was trained as a [pure] mathematician, and what I do professionally is a lot closer to engineering.)

Towards Scala 3

$
0
0

Thursday 19 April 2018

Martin Odersky

Now that Scala 2.13 is only a few months away, it’s time to consider the roadmap beyond it. It’s been no secret that the work on Dotty over the last 5 years was intended to explore what a new Scala could look like. We are now at a stage where we can commit: Dotty will become Scala 3.0.

Of course, this statement invites many follow-up questions. Here are some answers we can already give today. We expect there will be more questions and answers as things shape up.

When will it come out?

The intent is to publish the final Scala 3.0 soon after Scala 2.14. At the current release schedule (which might still change), that means early 2020.

What is Scala 2.14 for?

Scala 2.14’s main focus will be on smoothing the migration to Scala 3. It will do this by defining migration tools, shim libraries, and targeted deprecations, among others.

What’s new in Scala 3?

Scala has pioneered the fusion of object-oriented and functional programming in a typed setting. Scala 3 will be a big step towards realizing the full potential of these ideas. Its main objectives are to

  • become more opinionated by promoting programming idioms we found to work well,
  • simplify where possible,
  • eliminate inconsistencies and surprising behavior,
  • build on strong foundations to ensure the design hangs well together,
  • consolidate language constructs to improve the language’s consistency, safety, ergonomics, and performance.

The main language changes, either implemented or projected, are listed in the Reference section on the Dotty website. Many of the new features will be submitted to the SIP process, subject to approval.

It’s worth emphasizing that Scala 2 and Scala 3 are fundamentally the same language. The compiler is new, but nearly everything Scala programmers already know about Scala 2 applies to Scala 3 as well, and most ordinary Scala 2 code will also work on Scala 3 with only minor changes.

What about migration?

As with previous Scala upgrades, Scala 3 is not binary compatible with Scala 2. They are mostly source compatible, but differences exist. However:

  • Scala 3 code can use Scala 2 artifacts because the Scala 3 compiler understands the classfile format for sources compiled with Scala 2.12 and upwards.
  • Scala 3 and Scala 2 share the same standard library.
  • With some small tweaks it is possible to cross-build code for both Scala 2 and 3. We will provide a guide defining the shared language subset that can be compiled under both versions.
  • The Scala 3 compiler has a -language:Scala2 option that lets it compile most Scala 2 code and at the same time highlights necessary rewritings as migration warnings.
  • The compiler can perform many of the rewritings automatically using a -rewrite option.
  • Migration through automatic rewriting will also be offered through the scalafix tool, which can convert sources to the cross-buildable language subset without requiring Scala 3 to be installed.

What’s the expected state of tool support?

  • Compiler: The Scala 3 compiler dotc has been used to compile itself and a growing set of libraries for a number of years now.
  • IDEs: IDE support is provided by having dotc implement LSP, the Language Server Protocol, including standard operations such as completion and hyperlinking and more advanced ones such as find references or rename. There’s a VS Code plugin incorporating these operations. JetBrains has also released a first version of Scala 3 support in their Scala IntelliJ plugin, and we intend to work with them on further improvements.
  • REPL: A friendly REPL is supported by the compiler
  • Docs: A revamped Scaladoc tool generates docs for viewing in a browser and (in the future) also in the IDE..
  • Build tools: There is a Dotty/Scala 3 plugin for sbt, and we will also work on Scala 3 integration in other build tools.

What about stability?

  • A community build contains some initial open source projects that are compiled nightly using Scala 3. We plan to add a lot more projects to the build between now and the final release.
  • We plan to use the period of developer previews to ensure that core projects are published for Scala 3.
  • We have incorporated most of the Scala 2 regression tests in the Scala 3 test suite and will keep including new tests.
  • In the near future we plan to build all Scala 3 tools using a previous version of the dotc compiler itself. So far all tools are built first with the current Scala compiler and then again with dotc. Basing the build exclusively on Scala 3 has the advantage that it lets us “eat our own dog food” and try out the usability of Scala 3’s new language feature on a larger scale.

When can I try it out?

You can start working with Dotty now. See the getting started guide. Dotty releases are published every 6 weeks. We expect to be in feature-freeze and to release developer previews for Scala 3.0 in the first half of 2019.

What about macros?

Stay tuned! We are about to release another blog post specifically about that issue.

How can I help?

Scala 3 is developed completely in the open athttps://github.com/lampepfl/dotty. Get involved there, by fixing and opening issues, making pull requests, and participating in the discussions.

Just About Everyone with Bitcoin Is Lying to the IRS

$
0
0

Bitcoin may have peaked in terms of value and public interest in 2017, but the cryptocurrency has been far less popular this tax season. According to a recent report, fewer than 100 people have reported bitcoin holdings so far.

The figure, reported by CNBC, comes from Credit Karma, a popular financial app through which people can view their credit score and file taxes. Of the more than 250,000 people who have filed through Credit Karma this year, a whopping 0.0004 percent have claimed to have money in bitcoin or other cryptocurrencies.

This means one of two things: either the incredible rise and completely inevitably fall of the cryptocurrency market was all just a fever dream and no one actually put a single cent into the inherently risky concept because they all knew safe, reliable investment options were available... or a bunch of people are lying to the IRS.

It is almost certainly the latter that is true. “If I had to guess, there’s probably a lot of underreporting,” Elizabeth Crouse, a partner at law firm K&L Gates, told CNBC. “Most of the people in the cryptocurrency world tend to have a pretty high risk tolerance.”

Cryptocurrency holders aren’t exactly new to fudging the numbers on their taxes. According to the IRS, just 802 people in total reported cryptocurrency gains and losses on their tax filings in 2015. While the digital currencies were nowhere near as popular back then, assuredly more than a couple hundred people had holdings.

In part, it’s the complexities of reporting cryptocurrency transactions that have caused investors to opt out of the process entirely. The IRS has offered guidance on bitcoin transactions since 2014 and considers the cryptocurrency to be property, not currency. As such, every purchase, sale, trade, and mining effort is considered to be a taxable event.

For those who spent the whole year trading on the cryptocurrency markets, that creates a lot of activity to report—especially for those who didn’t make much or ended up losing it all during the market’s downturn.

Bitcoin traders have been less than pleased with the realization they have to pay taxes on any of their earnings. Some have learned that they racked up a tax bill considerably larger than their annual income. On Reddit, users have reportedowingtens of thousands of dollars to the IRS for their transactions.

Others have opted simply not to pay and dared the IRS to come find them, betting on the anonymity of the blockchain. This is, of course, a terrible idea, as bitcoin is not anonymous and the IRS has the tools to track people down.

Taxes are due April 17th this year. Here’s hoping you don’t have any bitcoin to report.

[Investopia, CNBC]

The ‘Terms and Conditions’ Reckoning Is Coming

$
0
0

Eleanor Margolis had used PayPal for more than a decade when the online payment provider blocked her account in January. The reason: She was 16 years old when she signed up, and PayPal Holdings Inc. insists she should have known the minimum age is 18, because the rule is clearly stated in terms and conditions she agreed to. Clearly stated, that is, in a document longer than The Great Gatsby—almost 50,000 words spread across 21 separate web pages. “They didn’t have any checks in place to make sure I was over 18,” says Margolis, now 28. “Instead, they contact me 12 years later. It’s completely absurd.”

Personal finance forums online are brimming with complaints from hundreds of PayPal customers who say they’ve been suspended because they signed up before age 18. PayPal declined to comment on any specific cases, but says it’s appropriate to close accounts created by underage people “to ensure our customers have full legal capacity to accept our user agreement.” While that may seem “heavy-handed,” says Sarah Kenshall, a technology attorney with law firm Burges Salmon, the company is within its rights because the users clicked to agree to the rules—however difficult the language might be to understand.

Websites have long required users to plow through pages of dense legalese to use their services, knowing that few ever give the documents more than a cursory glance. In 2005 security-software provider PC Pitstop LLC promised a $1,000 prize to the first user to spot the offer deep in its terms and conditions; it took four months before the reward was claimed. The incomprehensibility of user agreements is poised to change as tech giants such as Uber Technologies Inc. and Facebook Inc. confront pushback for mishandling user information, and the European Union prepares to implement new privacy rules called the General Data Protection Regulation, or GDPR. The measure underscores “the requirement for clear and plain language when explaining consent,” British Information Commissioner Elizabeth Denham wrote on her blog last year.

During two days of testimony before the U.S. Congress this month, Mark Zuckerberg, Facebook’s chief executive officer, was repeatedly chastised for burying important information in text that’s rarely read. Waving a 2-inch-thick printed version of the social network’s user agreement, Senator Lindsey Graham quoted a line from the first page, then intoned: “I’m a lawyer, and I have no idea what that means.” The South Carolina Republican later asked Zuckerberg whether he thinks consumers understand what they’re signing up for. The Facebook CEO’s response: “I don’t think the average person likely reads that whole document.”

GDPR, which comes into force in Europe in May and calls for fines as high as 4 percent of a company’s global revenue for violations, will make it tougher to get away with book-length user agreements, says Eduardo Ustaran, co-director of the cybersecurity practice at law firm Hogan Lovells. He suggests that companies streamline their rules and make sure they’re written in plain English. If a typical user wouldn’t understand the documents, the consent that companies rely on for their business activities would be legally invalid. “Your whole basis for using people’s personal data would disappear,” Ustaran says.

Companies are scrambling to ensure their user agreements comply with the law, says Julian Saunders, founder of Port.im, a British software maker that helps businesses adapt to GDPR. But he says many website owners aren’t yet explicit enough in stating why they’re collecting a consumer’s information, which other companies might gain access to it, and how people can ensure their data are deleted if they request it. Saunders says he’s signed up 100 businesses for the service and urges them to bend over backward in helping users understand the details. “Areas that used to get hidden in the small print of terms and conditions should now be exposed,” he says.

Martin Garner, an analyst at technology consultancy CCS Insight, suggests companies walk readers through their policies step by step. That way they could opt out of selected provisions—limiting, for instance, third parties that can gain access to the data or restricting the kinds of information companies may stockpile. Much of what’s in the terms and conditions might be affected by the settings a user chooses, and including that information in the initial agreement unnecessarily complicates the document. “Users typically only have the choice of accepting the terms and conditions in their entirety or not using the service at all,” Garner says. Companies must “pay much closer attention to explaining to users how their data will be stored and used and getting them to consent to that explicitly.”

BOTTOM LINE - To comply with new EU data regulations, website owners are scaling back and simplifying complex user agreements that can be longer than many novels.

Google changes its messaging strategy again: Goodbye to Allo, double down on RCS

$
0
0

long-and-winding road to figuring out messaging is taking yet another change of direction after the company called time on Allo, its newest chat app launch, in order to double down on its vision to enable an enhanced version of SMS.

The company told The Verge that it is “pausing” work on Allo, which was only launched as recently as September 2016, in order to put its resources into the adoption of RCS (Rich Communication Services), a messaging standard that has the potential to tie together SMS and other chat apps. RCS isn’t new, and Google has been pushing it for some time, but now the company is rebranding it as “Chat” and putting all its efforts into getting operators on board.

The new strategy will see almost the entire Allo team switch to Android Messages, according to The Verge.

In case you didn’t hear about it before, RCS is essentially a technology that allows basic “SMS” messaging to be standardized across devices. In the same way that iMessage lets Apple device owners chat for free using data instead of paid-for SMS, RCS could allow free chats across different networks on Android or other devices. RCS can be integrated into chat apps, which is something Google has already done with Android Messages, but the tipping point is working with others, and that means operators.

Unlike Apple, RCS is designed to work with carriers who can develop their own messaging apps that work with the protocol and connect to other apps, which could include chat apps. Essentially, it gives them a chance to take part in the messaging boom, rather than be cut out as WhatsApp, Messenger, iMessage and others take over. They don’t make money from consumers, but they do get to keep their brand and they can look to get revenue from business services.

But this approach requires operators themselves to implement the technology. That’s no easy thing as carriers don’t exactly trust tech companies — WhatsApp alone has massively eaten into its SMS and call revenues — and they don’t like working with each other, too.

Google said more than 55 operators worldwide have been recruited to support Chat, but it isn’t clear exactly when they might roll it out. Microsoft is among the OEM supporters, which raises the possibility it could bring support to Windows 10, but the company was non-committal when The Verge pressed it on that possibility.

Google has tried many things on messaging, but it has largely failed because it doesn’t have a ramp to users. WhatsApp benefited from being a first mover — all the other early leaders in Western markets are nowhere to be seen today — and Facebook Messenger is built on top of the world’s most popular social network.

Both of those services have more than one billion active users, Allo never got to 50 million. Google search doesn’t have that contact, and the company’s previous efforts didn’t capture market share. (Hangouts was promising but it has pivoted into a tool for enterprises.)

That left Google with two options: take on carriers directly with an iMessage-style service that’s built into Android, or work with them.

It chose the second option. It is far messier with so many different parties involved, but it is also apparently a principled approach.

“We can’t do it without these [carrier and OEM] partners. We don’t believe in taking the approach that Apple does. We are fundamentally an open ecosystem. We believe in working with partners. We believe in working with our OEMs to be able to deliver a great experience,” Anil Sabharwal, the Google executive leading Chat, told The Verge.

Sabharwal refused to be drawn on a timeframe for operators rolling out Chat apps.

“By the end of this year, we’ll be in a really great state, and by mid-next year, we’ll be in a place where a large percentage of users [will have] this experience,” he said, explaining that uptake could be quicker in Europe or Latin America than the U.S. “This is not a three-to-five-year play. Our goal is to get this level of quality messaging to our users on Android within the next couple of years.”

We shall see. But at least there won’t be yet more Google messaging apps launching, so there’s that.

No “Material Difference Between 5G and LTE”

$
0
0

VZ ATT 5G Nonsense 230Eric Xu, current Huawei Chairman, concludes "consumers would find no 'material difference between 5G & LTE'.” Louise Lucas and Nic Fildes,  Financial Times He added, "Since 4G is robust, we don’t see many use cases or applications we need to support with 5G.” 

Last May, I reported similar thoughts from Telefonica CTO Enrico Blanco. In November, DT CTO Bruno Jacobfeuerborn,  FT/Orange SVP Arnaud Vamparys, and BT CEO Gavin Patterson chimed in. The politicians & marketers screamed "5G Revolution." The engineers knew better. 

Andrus Anders and Roberto Viola at the EU, as well as Jessica Rosenworcel & Ajit Pai in the U.S., are still lying to themselves and too proud to face their errors.

Latency:  Ericsson has promised LTE latency of 9 ms in 2018. AT&T's 5G latency is 9-11 ms.

Speed: 4G LTE 2018 is hundreds of megabits, peaking over a gigabit. 90% of 5G on the way is midband, the same hundreds of megabits. Only 10-20% of the first few years will be millimeter wave, often a gigabit.

Applications: 5G's main application around the world will be more capacity, a good thing for telcos. But most of the other claims are b______. 

  • Connected cars are already on the road, using lidar & radar, not the phone network. Xu points out, “even today we have the technology that can support autonomous driving”.
  • The talk of remote surgery is totally ridiculous unless you believe in operating from the beach. The wireless part of the transmission is only a small part of the speed.   From the local connection to the operating room will be 20-50 ms. Whether the wireless is 2, 5, or 10 ms. is a minor factor. 
  • IoT will rarely require speeds more than 100's of megabits. Most actually is kilobits. If there so many connections that LTE is overwhelmed, a local Wi-Fi or LTE picocell can handle almost everything.
  • VR experts tell me they are fine at 10-15 ms.
  • Outside the U.S., very little fixed wireless requires 5G speeds. AT&T and Verizon will use mmWave where they don't have decent broadband, about 75% of the country in each case. Nearly everywhere else, the telco covers the whole country with landlines.

"5G Revolution" is dead. Millimeter wave and Massive MIMO will be crucial to telcos going forward; midband spectrum will be important, running at LTE speeds and really LTE with a minor software tweak and a press release. Until 2018, it was called TD-LTE.

Verizon's D.C. rep, CTIA, has nonsense claims for 5G, based on 4G applications. 

BPF, EBPF, XDP and Bpfilter

$
0
0
You may have been following the development of the extended Berkeley Packet Filter  (eBPF)  in the kernel community since 3.15, or you may still associate the Berkeley Packet Filter with the work Van Jacobson did in 1992. You may have used BPF for years with tcpdump, or you may have started to plumb it in your data plane already! This blog aims to describe, at very high level, the key developments from a performance networking point of view and why now this is becoming important to the network operator, sysadmin and enterprise solution provider in the same way that it has been relevant since its inception for the large scale data center operator.

BPF or eBPF—What is the difference?

The Virtual Machine

Fundamentally BPF is still BPF: it is a small virtual machine which runs programs injected from user space and attached to specific hooks in the kernel. It can classify and do actions upon network packets. For years it has been used on Linux to filter packets and avoid expensive copies to user space, for example with tcpdump. However, the scope of the virtual machine has changed beyond recognition over the last few years.

Figure 1. A comparison of the cBPF vs. eBPF machines

Classic BPF (cBPF), the legacy version, consisted of a 32-bit wide accumulator, a 32-bit wide ‘X’ register which could also be used within instructions, and 16 32-bit registers which are used as a scratch memory store. This obviously led to some key restrictions. As the name suggests, the classic Berkeley Packet Filter was mostly limited to (stateless) packet filtering. Anything more complex would be completed within other subsystems.

eBPF significantly widened the set of use cases for BPF, through the use of an expanded set of registers and of instructions, the addition of maps (key/value stores without any restrictions in size), a 512 byte stack, more complex lookups, helper functions callable from inside the programs, and the possibility to chain several programs. Stateful processing is now possible, as are dynamic interactions with user space programs. As a result of this improved flexibility, the level of classification and the range of possible interactions for packets processed with eBPF has been drastically expanded.

But new features must not come at the expense of safety. To ensure proper exercise of the increased responsibility of the VM, the verifier implemented in the kernel has been revised and consolidated. This verifier checks for any loops within the code (which could lead to possible infinite loops, thus hanging the kernel) and any unsafe memory accesses. It rejects any program that does not meet the safety criterions. This step, performed on a live system each time a use tries to inject a program, is followed by the BPF bytecode being JITed into native assembly instructions for the chosen platform.

Figure 2. Compilation flow of an eBPF program on the host. Some supported CPU architectures are not displayed.

To allow any key functionality which would be difficult to do or optimize within the restrictions of eBPF, there are many helpers which are designed to assist with the execution of processes such as map lookups or the generation of random numbers. My colleague, Quentin Monnet, is currently going through the process of getting all kernel helpers documented and has a patch set out for review.
The hooks  - Where do the packets get classified?

The amount of hooks for eBPF is proliferating due to its flexibility and usefulness. However, we will focus on those at the lower end of the datapath. The key difference here being that eBPF adds an additional hook in driver space. This hook is called eXpress DataPath, or XDP. This allows users to drop, reflect or redirect packets before they have an skb (socket buffer) metadata structure added to the packet. This leads to a performance improvement of about 4-5X.

Figure 3. High-performance networking relevant  eBPF hooks with comparative performance for simple use case

Offloading eBPF to the NFP

Back in 4.9, my colleague and our kernel driver maintainer, Jakub Kicinski, added the Network Flow Processor (NFP) BPF JIT-compiler to the kernel, initially for cls_bpf (https://www.spinics.net/lists/netdev/msg379464.html). Since then Netronome has been working on improving the BPF infrastructure in the kernel and also in LLVM, which generates the bytecode (thanks to the work of Jiong Wang). Through the NFP JIT, we have managed to effectively modify the program flow as shown in the diagram below:

Figure 4. Compilation flow with NFP JIT included (some supported CPU architectures are not displayed)

The key reason this was possible was because of how well the BPF machine maps to our flow processing cores on the NFP, this means that the NFP-based Agilio CX SmartNIC running at between 15-25W can offload a significant amount of processing from the host. In the load balancing example below, the NFP processes the same amount of packets as nearly 12 x86 cores from the host combined, an amount physically impossible for the host to handle due to PCIe bandwidth restrictions (Cores used: Intel Xeon CPU E5-2630 v4 @ 2.20GHz).

Figure 5. Comparative performance of a sample load balancer on the NFP and x86 CPU (E5-2630 v4)

Performance is one of the main reasons why using BPF hardware offload is the correct technique to program your SmartNIC. But it is not the only one: let’s review some other incentives.

  1. Flexibility: One of the key advantages BPF provides on the host is the ability to reload programs on-the-fly. This enables the dynamic replacement of programs in an operating data center. Code which would otherwise likely be out-of-tree kernel code, or inserted in some other less flexible subsystem, can now be easily loaded or unloaded. This provides significant advantages to the data center due to bugs not requiring system restarts: instead, simply reloading an adjusted program will do.

    This model can now be extended to offload as well. Users are able to dynamically load, unload, reload programs on the NFP while traffic is running. This dynamic rewriting of firmware at runtime provides a powerful tool to reactively use the NFP’s flexibility and performance.

  2. Latency: By offloading eBPF, latency is significantly reduced due to packets not having to cross the PCIe boundary. This can improve network hygiene for load balancing or for NAT use cases. Note that by avoiding the PCIe boundary, there is also a significant benefit in the DDoS prevention case as packets no longer cross the boundary, which could otherwise form the bottleneck under a well constructed DDoS attack.

    Figure 6 

    Figure 6. Latency of offloaded XDP vs. XDP in the driver, note the consistency in latency when using offload

  3. Interface to program your SmartNIC datapath: By being able to program a SmartNIC using eBPF, it means that it is extremely easy to implement features such as rate limiting, packet filtering, bad actor mitigation or other features that traditional NICs would have to implement in silicon. This can be customized for the end user’s specific use case.
Ok -this is all great - but how do I actually use it?

The first thing to do is to update your kernel to 4.16 or above. I would recommend 4.17 (development version as of this writing) or above to take advantage of as many features as possible. See the user guide *linked* to get the features from the latest version, and examples on how to utilize them.

My colleague, David Beckett, has done a couple of great videos showing how to use XDP, find the one about the general use case here.

Without entering into the details here, it can also be noted that the tooling related to eBPF workflow is under development, and has already been greatly improved regarding the legacy cBPF version. Users would now typically write programs in C, and compile them with the back-end offered by clang-LLVM into eBPF bytecode. Other languages, including Go, Rust or Lua, are available too. Support for the eBPF architecture was added to traditional tools: llvm-objdump can dump eBPF bytecode in a human-readable format, llvm-mc can be used as an eBPF assembler, strace can trace calls to the bpf() system call. Some work is still in progress for other tools: binutils disassembler should support NFP microcode soon, and valgrind is about to get support for the system call. New tools are created as well: bpftool, in particular, is exceptionally useful for introspection and simple management of eBPF objects.

bpfilter
A key question which is still outstanding at this point for the enterprise sysadmin or IT architect is how this applies to the end user who has a setup which has been perfected and maintained for years, and which is based upon iptables. The risk of changing this setup is that, obviously, something may be hard to countenance, layers of orchestration would have to be modified, new APIs should be built, etc. To solve this problem, enter the proposed bpfilter project. As Quentin wrote earlier this year:

“This technology is a proposal for a new eBPF-based back-end for the iptables firewall in Linux. As of this writing, it is in a very early stage: it was submitted as a RFC (Request for Comments) on the Linux netdev mailing list around mid-February 2018 by David Miller (maintainer of the networking system), Alexei Starovoitov and Daniel Borkmann (maintainers of the BPF parts in the kernel). So, keep in mind that all details that follow could change, or could not ever reach the kernel at all! 

Technically, the iptables binary used to configure the firewall would be left untouched, while the xtables part in the kernel could be transparently replaced by a new set of commands that would require the BPF sub-system to translate the firewalling rules into an eBPF program. This program could then be attached to one of the network-related kernel hooks, such as on the traffic control interface (TC) or at the driver level (XDP). Rule translation would occur in a new kind of kernel module that would be something between traditional modules and a normal ELF user space binary. Running in a special thread with full privilege but no direct access to the kernel, thus providing less attack surface, this special kind of module would be able to communicate directly with the BPF sub-system (mostly through system calls). And at the same time, it would remain very easy to use standard user space tools to develop, debug or even fuzz it! Besides this new module object, the benefits from the bpfilter approach could be numerous. Increased security is expected, thanks to the eBPF verifier. Reusing the BPF sub-system could possibly make maintenance of this component easier than for the legacy xtables and could possibly provide later integration with other components of the kernel that also rely on BPF. And of course, leveraging just-in-time (JIT) compiling, or possibly hardware offload of the program would enable a drastic improvement in performance!”

bpfilter is being developed as a solution to the problem. It allows the end user to seamlessly move to this new, high performance paradigm. Just see below: comparing eight cores of CPU and the offload to the NFP of a simple series of iptables rules with iptables (netfilter) legacy back-end, the newer nftables, bpfilter on the host and offloaded to the SmartNIC clearly shows where performance lies.

Figure 7. Performance comparison of bpfilter vs older iptables implementations
An example video of how to implement this from David can be found here.

Summary

So there we have it. What is being produced within the kernel community as it stands is a massively powerful shift in networking. eBPF is a powerful tool that brings programmability to the kernel. It can deal with congestion control (TCP-BPF), tracing (kprobes, tracepoints) and high-performance networking (XDP, cls_bpf). Other use cases are likely to appear, as a result of its success among the community. Beyond this, the transition extends until the end user, who will soon be able to seamlessly leave the old iptables back-end in favor of a newer, and much more efficient XDP-based back-end—using the same tools as today. In particular, this will allow for straightforward hardware offload, and provide the necessary flexibility as users move to 10G and above networks.

Thanks for reading! Any insights belong to my colleagues, any errors are mine… Feel free to drop me an email if you have any further questions at nick.viljoen@netronome.com and cc oss-drivers@netronome.com.

Other useful links

Netdev 1.2, eBPF/XDP hardware offload to SmartNICs, Jakub Kicinski & Nic Viljoen: Video, Slides, Paper.
Netdev 2.2, Comprehensive XDP offhandling the edge cases, Jakub Kicinski & Nic Viljoen: Video, Slides, Paper.
Quentin’s personal blog contains a great set of links towards additional eBPF and XDP resources.

Python API for Zero Day Phishing Detection Based on Computer Vision

$
0
0

README.rst

Summary

Official python API for Phish.AI public and private API to detect zero-day phishing websites

How it Works (TLDR)

Essentially we have a very big computer vision database of known websites and their legitimate domains. The API surf to a given website takes a screenshots of the website and then compare it with our database and if we detect that it is similar to a known website but hosted on a different domain we classify it as malicious and classify the targeted brand (which website this site tries to mimic).

The Engine is in beta and doesn't protect all brands yet. we make the database bigger every day, if you believe your brand is not in our database and you want us to crawl it, just drop me a line at yp@phish.ai

Privacy Policy

The full privacy policy is at: https://www.phish.ai/phish-ai-privacy-policy/. By using the Public API you agree to our Privacy Policy and allow us to share your submission with the security community. If you want a Private API Key please contact us at info@phish.ai.

Useful resources

Installation

$ pip install phish-ai-api

Usage

from__future__import print_functionfrom phish_ai_api importAPI

ph = API(api_key='None or private api key you can request at info@phish.ai')
res = ph.scan_url('https://google.com')print(res)print(ph.get_report(res['scan_id']))

Output

{"scan_id": "pQz7bGMwxgzGboNyX8cy"}
{"domain": "google.com","ip_address": "74.125.124.113","iso_code": "US","status": "completed","target": "Google","time": "2018-04-15T07:27:37.860Z","title": "google","tld": "com","url": "http://google.com","user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/67.0.3391.0 Safari/537.36","user_email": "api","verdict": "clean"}

Issues & Contributing

Found a Bug/Have a feature request feel free to open an Issue and we will look into it. Cheers.

The illusion of time

$
0
0

The Order of TimeCarlo Rovelli Allen Lane (2018)

According to theoretical physicist Carlo Rovelli, time is an illusion: our naive perception of its flow doesn’t correspond to physical reality. Indeed, as Rovelli argues in The Order of Time, much more is illusory, including Isaac Newton’s picture of a universally ticking clock. Even Albert Einstein’s relativistic space-time — an elastic manifold that contorts so that local times differ depending on one’s relative speed or proximity to a mass — is just an effective simplification.

So what does Rovelli think is really going on? He posits that reality is just a complex network of events onto which we project sequences of past, present and future. The whole Universe obeys the laws of quantum mechanics and thermodynamics, out of which time emerges.

Rovelli is one of the creators and champions of loop quantum gravity theory, one of several ongoing attempts to marry quantum mechanics with general relativity. In contrast to the better-known string theory, loop quantum gravity does not attempt to be a ‘theory of everything’ out of which we can generate all of particle physics and gravitation. Nevertheless, its agenda of joining up these two fundamentally differing laws is incredibly ambitious.

Alongside and inspired by his work in quantum gravity, Rovelli puts forward the idea of ‘physics without time’. This stems from the fact that some equations of quantum gravity (such as the Wheeler–DeWitt equation, which assigns quantum states to the Universe) can be written without any reference to time at all.

As Rovelli explains, the apparent existence of time — in our perceptions and in physical descriptions, written in the mathematical languages of Newton, Einstein and Erwin Schrödinger — comes not from knowledge, but from ignorance. ‘Forward in time’ is the direction in which entropy increases, and in which we gain information.

The book is split into three parts. In the first, “The Crumbling of Time”, Rovelli attempts to show how established physics theories deconstruct our common-sense ideas. Einstein showed us that time is just a fourth dimension and that there is nothing special about ‘now’; even ‘past’ and ‘future’ are not always well defined. The malleability of space and time mean that two events occurring far apart might even happen in one order when viewed by one observer, and in the opposite order when viewed by another.

Rovelli gives good descriptions of the classical physics of Newton and Ludwig Boltzmann, and of modern physics through the lenses of Einstein and quantum mechanics. There are parallels with thermodynamics and Bayesian probability theory, which both rely on the concept of entropy, and might therefore be used to argue that the flow of time is a subjective feature of the Universe, not an objective part of the physical description.

But I quibble with the details of some of Rovelli’s pronouncements. For example, it is far from certain that space-time is quantized, in the sense of space and time being packaged in minimal lengths or periods (the Planck length or time). Rather, our understanding peters out at those very small intervals for which we need both quantum mechanics and relativity to explain things.

In part two, “The World without Time”, Rovelli puts forward the idea that events (just a word for a given time and location at which something might happen), rather than particles or fields, are the basic constituents of the world. The task of physics is to describe the relationships between those events: as Rovelli notes, “A storm is not a thing, it’s a collection of occurrences.” At our level, each of those events looks like the interaction of particles at a particular position and time; but time and space themselves really only manifest out of their interactions and the web of causality between them.

In the final section, “The Sources of Time”, Rovelli reconstructs how our illusions have arisen, from aspects of thermodynamics and quantum mechanics. He argues that our perception of time’s flow depends entirely on our inability to see the world in all its detail. Quantum uncertainty means we cannot know the positions and speeds of all the particles in the Universe. If we could, there would be no entropy, and no unravelling of time. Rovelli originated this ‘thermal time hypothesis’ with French mathematician Alain Connes.

The Order of Time is a compact and elegant book. Each chapter starts with an apt ode from classical Latin poet Horace — I particularly liked “Don’t attempt abstruse calculations”. And the writing, translated from Italian by Erica Segre and Simon Carnell, is more stylish than that in most physics books. Rovelli ably brings in the thoughts of philosophers Martin Heidegger and Edmund Husserl, sociologist Émile Durkheim and psychologist William James, along with physicist-favourite philosophers such as Hilary Putnam and Willard Van Orman Quine. Occasionally, the writing strays into floweriness. For instance, Rovelli describes his final section as “a fiery magma of ideas, sometimes illuminating, sometimes confusing”.

Ultimately, I’m not sure I buy Rovelli’s ideas, about either loop quantum gravity or the thermal time hypothesis. And this book alone would not give a lay reader enough information to render judgement. The Order of Time does, however, raise and explore big issues that are very much alive in modern physics, and are closely related to the way in which we limited beings observe and participate in the world.

Sign up for the daily Nature Briefing email newsletter

Stay up to date with what matters in science and why, handpicked from Nature and other publications worldwide.

Sign Up

Facebook has auto-enrolled users into a facial recognition test in Europe

$
0
0

users in Europe are reporting the company has begun testing its controversial facial recognition technology in the region.

Jimmy Nsubuga, a journalist at Metro, is among several European Facebook users who have said they’ve been notified by the company they are in its test bucket.

The company has previously said an opt-in option for facial recognition will be pushed out to all European users next month. It’s hoping to convince Europeans to voluntarily allow it to expand its use of the privacy-hostile tech — which was turned off in the bloc after regulatory pressure, back in 2012, when Facebook was using it for features such as automatically tagging users in photo uploads.

Under impending changes to its T&Cs — ostensibly to comply with the EU’s incoming GDPR data protection standard — the company has crafted a manipulative consent flow that tries to sell people on giving it their data; including filling in its own facial recognition blanks by convincing Europeans to agree to it grabbing and using their biometric data after all. 

Notably Facebook is not offering a voluntary opt-in to Europeans who find themselves in its facial recognition test bucket. Rather users are being automatically turned into its lab rats — and have to actively delve into the settings to say no.

In a notification to affected users, the company writes [emphasis ours]: “You control face recognition. This setting is on, but you can turn it off at any time, which applies to features we may add later.”

Not only is the tech turned on, but users who click through to the settings to try and turn it off will also find Facebook attempting to dissuade them from doing that — with manipulative examples of how the tech can “protect” them.

As another Facebook user who found herself enrolled in the test — journalist Jennifer Baker — points out, what it’s doing here is incredibly disingenuous because it’s using fear to try to manipulate people’s choices.

Under the EU’s incoming data protection framework Facebook will not be able to automatically opt users into facial recognition — it will have to convince people to switch the tech on themselves.

But the experiment it’s running here (without gaining individuals’ upfront consent) looks very much like a form of A/B testing — to see which of its disingenuous examples is best able to convince people to accept the highly privacy-hostile technology by voluntarily switching it on.

But given that Facebook controls the entire consent flow, and can rely on big data insights gleaned from its own platform (of 2BN+ users), this is not even remotely a fair fight.

Consent is being manipulated, not freely given. This is big data-powered mass manipulation of human decisions — i.e. until the ‘right’ answer (for Facebook’s business) is ‘selected’ by the user.

Data protection experts we spoke to earlier this week do not believe Facebook’s approach to consent will be legal under GDPR. Legal challenges are certain at this point.

But legal challenges also take time. And in the meanwhile Facebook users will be being manipulated into agreeing with things that align with the company’s data-harvesting business interests — and handing over their sensitive personal information without understanding the full implications.

It’s also not clear how many Facebook users are being auto-enrolled into this facial recognition test — we’ve put questions to it and will update this post with any reply.

Last month Facebook said it would be rolling out “a limited test of some of the additional choices we’ll ask people to make as part of GDPR”.

It also said it was “starting by asking only a small percentage of people so that we can be sure everything is working properly”, and further claimed: “[T]he changes we’re testing will let people choose whether to enable facial recognition, which has previously been unavailable in the EU.”

Facebook’s wording in those statements is very interesting — with no number put on how many people will be made into test subjects (though it is very clearly trying to play the experiment down; “limited test”, “small”) — so we simply don’t know how many Europeans are having their facial data processed by Facebook right now, without their upfront consent.

Nor do we know where in Europe all these test subjects are located. But it’s pretty likely the test contravenes even current EU data protection laws. (GDPR applies from May 25.)

Facebook’s description of its testing plan last month was also disingenuous as it implied users would get to choose to enable facial recognition. In fact, it’s just switching it on — saddling test subjects with the effort of opting out.

The company was likely hoping the test would not attract too much attention — given how much GDPR news is flowing through its PR channels, and how much attention the topic is generally sucking up — and we can see why now because it’s essentially reversed its 2012 decision to switch off facial recognition in Europe (made after the feature attracted so much blow-back), to grab as much data as it can while it can.

Millions of Europeans could be having their fundamental rights trampled on here, yet again. We just don’t know what the company actually means by “small”. (The EU has ~500M inhabitants — even 1%, a “small percentage”, of that would involve millions of people… )

Once again Facebook isn’t telling how many people it’s experimenting on.

Xz format inadequate for long-term archiving (2017)

$
0
0
Xz format inadequate for long-term archiving

Abstract

One of the challenges of digital preservation is the evaluation of data formats. It is important to choose well-designed data formats for long-term archiving. This article describes the reasons why the xz compressed data format is inadequate for long-term archiving and inadvisable for data sharing and for free software distribution. The relevant weaknesses and design errors in the xz format are analyzed and, where applicable, compared with the corresponding behavior of the bzip2, gzip and lzip formats. Key findings include: (1) safe interoperability among xz implementations is not guaranteed; (2) xz's extensibility is unreasonable and problematic; (3) xz is vulnerable to unprotected flags and length fields; (4) LZMA2 is unsafe and less efficient than the original LZMA; (5) xz includes useless features that increase the number of false positives for corruption; (6) xz shows inconsistent behavior with respect to trailing data; (7) error detection in xz is several times less accurate than in bzip2, gzip and lzip.

Disclosure statement: The author is also author of the lzip format.

1 Introduction

There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.
-- C.A.R. Hoare

Perfection is reached, not when there is no longer anything to add, but when there is no longer anything to take away.
-- Antoine de Saint-Exupery

Both the xz compressed data format and its predecessor lzma-alone have serious design flaws. But while lzma-alone is a toy format lacking fundamental features, xz is a complex container format full of contradictions. For example, xz tries to appear as a very safe format by offering overkill check sequences like SHA-256 but, at the same time it fails to protect the length fields needed to decompress the data in the first place. These defects make xz inadequate for long-term archiving and reduce its value as a general-purpose compressed data format.

This article analyzes the xz compressed data format, what is to mean the way bits are arranged in xz compressed files and the consequences of such arrangement. This article is about formats, not programs. In particular this article is not about bugs in any compression tool. The fact that the xz reference tool (xz-utils) has had more bugs than bzip2 and lzip combined is mainly a consequence of the complexity and bad design of the xz format. Also the uninformative error messages provided by the xz tool reflect the extreme difficulty of finding out what failed in case of corruption in a xz file.

This article started with a series of posts to the debian-devel mailing list [Debian], where it became clear that nobody had analyzed xz in any depth before adopting it in the Debian package format. The same unthinking adoption of xz seems to have happened in major free software projects, like GNU Coreutils and Linux. In my opinion, it is a mistake for any widely used project to become an early adopter of a new data format; it may cause a lot of trouble if any serious defect is later discovered in the format.

2 The reasons why the xz format is inadequate for long-term archiving

2.1 Xz is a container format

On Unix-like systems, where a tool is supposed to do one thing and do it well, compressed file formats are usually formed by the compressed data, preceded by a header containing the parameters needed for decompression, and followed by a trailer containing integrity information. Bzip2, gzip and lzip formats are designed this way, minimizing both overhead and false positives.

On the contrary, xz is a container format which currently contains another container format (LZMA2), which in turn contains an arbitrary mix of LZMA data and uncompressed data. In spite of implementing just one compression algorithm, xz already manages 3 levels of headers, which increases its fragility.

The xz format has more overhead than bzip2, gzip or lzip, most of it either not properly designed (e.g., unprotected headers) or plain useless (padding). In fact, a xz stream can contain such a large amount of overhead that the format designers deemed necessary to compress the overhead using unsafe methods.

There is no reason to use a container format for a general-purpose compressor. The right way of implementing a new compression algorithm is to provide a version number in the header, and the right way of implementing binary filters is to write a preprocessor that applies the filter to the data before feeding them to the compressor. (See for examplemince).

2.2 Xz is fragmented by design

Xz was designed as a fragmented format. Xz implementations may choose what subset of the format they support. In particular, integrity checking in xz offers multiple choices of check types, all of them optional except CRC32 which is recommended. (See[Xz format], section 2.1.1.2 'Stream Flags'. See also [RFC 2119] for the definitions of 'optional' and 'recommended'). Safe interoperability among xz implementations is not guaranteed. For example the xz-embedded decompressor does not support the optional check types. Other xz implementations may choose to not support integrity checking at all.

The xz reference tool (xz-utils) ignores the recommendation of the xz format specification and uses by default an optional check type (CRC64) in the files it produces. This prevents decompressors that do not support the optional check types from verifying the integrity of the data. Using --check=crc32 when creating the file makes integrity checking work on the xz-embedded decompressor, but as CRC32 is just recommended, it does not guarantee that integrity checking will work on all xz compliant decompressors. Distributing software in xz format can only be guaranteed to be safe if the distributor controls the decompressor run by the user (or can force the use of external means of integrity checking). Error detection in the xz format is broken; depending on how the file was created and on what decompressor is available, the integrity check in xz is sometimes performed and sometimes not. The latter is usually the case for the tarballs released in xz format by GNU and Linux when they are decompressed with the xz-embedded decompressor (see the third xz test in[benchmark]).

Fragmentation (subformat proliferation) hinders interoperability and complicates the management of large archives. The lack of guaranteed integrity checking increases the probability of undetected corruption. Bzip2, gzip and lzip are free from these defects; any decompressor can decompress and verify the integrity of any file in the corresponding format.

2.3 Xz is unreasonably extensible

The design of the xz format is based on the false idea that better compression algorithms can be mass-produced like cars in a factory. It has room for 2^63 filters, which can then be combined to make an even larger number of algorithms. Xz reserves less than 0.8% of filter IDs for custom filters, but even this small range provides about 8 million custom filter IDs for each human inhabitant on earth. There is not the slightest justification for such egregious level of extensibility. Every useless choice allowed by a format takes space and makes corruption both more probable and more difficult to recover from.

The basic ideas of compression algorithms were discovered early in the history of computer science. LZMA is based on ideas discovered in the 1970s. Don't expect an algorithm much better than LZMA to appear anytime soon, much less several of them in a row.

In 2008 one of the designers of xz (Lasse Collin) warned me that lzip would become stuck with LZMA while others moved to LZMA2, LZMA3, LZMH, and other algorithms. Now xz-utils is usually unable to match the compression ratio of lzip because LZMA2 has more overhead than LZMA and, as expected, no new algorithms have been added to xz-utils.

2.4 Xz's extensibility is problematic

The xz format lacks a version number field. The only reliable way of knowing if a given version of a xz decompressor can decompress a given file is by trial and error. The 'file' utility does not provide any help:

$ file COPYING.*
COPYING.lz: lzip compressed data, version: 1
COPYING.xz: XZ compressed data

Xz-utils can report the minimum version of xz-utils required to decompress a given file, but it must decode each block header in the file to find it out, and only can report older versions of xz-utils. If a newer version of xz-utils is required, it can't be known which one. The report is also useless to know what version of other decompressors (for example 7-zip) could decompress the file. Note that the version reported may be unable to decompress the file if xz-utils was builtwithout support for some feature present in the file.

The extensibility of bzip2 and lzip is better. Both formats provide a version field. Therefore it is trivial for them to seamlessly and reliably incorporate a new compression algorithm while making clear what version of the tool is required to decompress a given file; tool_version >= file_version. If an algorithm much better than LZMA is found, a version 2 lzip format (perfectly fit to the new algorithm) can be designed, along with a version 2 lzip tool able to decompress the old and new formats transparently. Bzip2 is already a "version 2" format. The reason why bzip2 does not decompress bzip files is that the original bzip format was abandoned because of problems with software patents.

The extensibility of gzip is obsolete mainly because of the 32-bit uncompressed size (ISIZE) field.

2.5 Xz fails to protect the length of variable size fields

According to [Koopman] (p. 50), one of the "Seven Deadly Sins" (i.e., bad ideas) of CRC and checksum use is failing to protect a message length field. This causes vulnerabilities due to framing errors. Note that the effects of a framing error in a data stream are more serious than what Figure 1 suggests. Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream.

Corruption of message length field
Figure 1. Corruption of message length field. Source: [Koopman], p. 30.

Except the 'Backward Size' field in the stream footer, none of the many length fields in the xz format is protected by a check sequence of any kind. Not even a parity bit. All of them suffer from the framing vulnerability illustrated in the picture above. In particular every LZMA2 header contains one 16-bit unprotected length field. Some length fields in the xz format are of variable size themselves, adding a new failure mode to xz not found in the other three formats; double framing error.

Bzip2 is affected by this defect to a lesser extent; it contains two unprotected length fields in each block header. Gzip may be considered free from this defect because its only top-level unprotected length field (XLEN) can be validated using the LEN fields in the extra subfields. Lzip is free from this defect.

Optional fields are just as unsafe as unprotected length fields if the flag that indicates the presence of the optional field is itself unprotected. The result is the same; framing errors. Again, except the 'Stream Flags' field, none of those flags in the xz format is protected by a check sequence. In particular the critically important 'Block Flags' field in block headers and bit 6 in the control byte of the numerous LZMA2 headers are not protected.

Bzip2 contains 16 unprotected flags for optional huffman bitmaps in each block header. Gzip just contains one byte with four unprotected flags for optional fields in its header. Lzip is free from optional fields.

2.6 Xz uses variable-length integers unsafely

Xz stores many (potentially large) numbers using a variable-length representation terminated by a byte with the most significant bit (msb) cleared. In case of corruption, not only the value of the field may become incorrect, the size of the field may also change, causing a framing error in the following fields. Xz uses such variable-length integers to store the size of other fields. In case of corruption in the size field, both the position and the size of the target field may become incorrect, causing a double framing error. See for example[Xz format], section 3.1.5 'Size of Properties' in 'List of Filter Flags'. Bzip2, gzip and lzip store all fields representing numbers in a safe fixed-length representation.

Xz features a monolithic index that is specially vulnerable to cascading framing errors. Some design errors of the xz index are:

  1. The number of records is coded as an unprotected variable-length integer vulnerable to double framing error.
  2. The size of the index is not stored anywhere. It must be calculated by decoding the whole index and can't be verified. ('Backward Size' stores the size of the index rounded up to the next multiple of four bytes, not the real size).
  3. When reading from unseekable sources, it delays the verification of the block sizes until the end of the stream and requires a potentially huge amount of RAM (up to 16 GiB), unless such verification is made by hashing, in which case it can't be known what blocks failed the test. The safe and efficient way is to verify the sizes of each block as soon as it is processed, as gzip and lzip do.
  4. The list of records is made of variable-length integers concatenated together. Regarding corruption it acts as one potentially very long unprotected variable-length integer. Just one bit flip in the msb of any byte causes the remaining records to be read incorrectly. It also causes the size of the index to be calculated incorrectly, losing the position of the CRC32 and the stream footer.
  5. Each record stores the size (not the position) of the corresponding block, but xz's block headers do not provide an identification string that could validate the block size. Therefore, just one bit flip in any 'Unpadded Size' field causes the positions of the remaining blocks to be calculated incorrectly. By contrast, lzip provides a distributed index where each member size is validated by the presence of the ID string in the corresponding member header. Neither the bzip2 format nor the gzip format do provide an index.

2.7 LZMA2 is unsafe and less efficient than the original LZMA

The xz-utils manual says that LZMA2 is an updated version of LZMA to fix some practical issues of LZMA. This wording suggests that LZMA2 is some sort of improved LZMA algorithm. (After all, the 'A' in LZMA stands for 'algorithm'). But LZMA2 is a container format that divides LZMA data into chunks in an unsafe way. In practice, for compressible data, LZMA2 is just LZMA with 0.015%-3% more overhead. The maximum compression ratio of LZMA is about 7051:1, but LZMA2 is limited to 6843:1 approximately.

The [LZMA2 format] contains an arbitrary mix of LZMA packets and uncompressed data packets. Each packet starts with a header that is not protected by any check sequence in spite of containing the type and size of the following data. Therefore, every bit flip in a LZMA2 header causes either a framing error or a desynchronization of the decoder. In any case it is usually not possible to decode the remaining data in the block or even to know what failed. Compare this with[Deflate] which at least does protect the length field of its non-compressed blocks. (Deflate's compressed blocks do not have a length field).

Note that of the 3 levels of headers in a xz file (stream, block, LZMA2), the most numerous LZMA2 headers are the ones not protected by a check sequence. There is usually one stream header and one block header in a xz file, but there is at least one LZMA2 header for every 64 KiB of LZMA2 data in the file. In extreme cases the LZMA2 headers can make up to a 3% of the size of the file:

-rw-r--r-- 1 14208 Oct 21 17:26 100MBzeros.lz
-rw-r--r-- 1 14195 Oct 21 17:26 100MBzeros.lzma
-rw-r--r-- 1 14676 Oct 21 17:26 100MBzeros.xz

The files above were produced by lzip (.lz) and xz-utils (.lzma, .xz). The LZMA stream is identical in the .lz and .lzma files above; they just differ in the header and trailer. The .xz file is larger than the other two mainly because of the 50 LZMA2 headers it contains. LZMA2 headers make xz both more fragile and less efficient (see the xz tests in[benchmark]). Additionally, corruption in the uncompressed packets of a LZMA2 stream can't be detected by the decoder, leaving the check sequence as the only way of detecting errors there.

On the other hand, the original LZMA data stream providesembedded error detection. Any distance larger than the dictionary size acts as a forbidden symbol, allowing the decoder to detect the approximate position of errors, and leaving very little work for the check sequence in the detection of errors.

LZMA2 could have been safer and more efficient if only its designers had copied the structure of Deflate; terminate compressed blocks with a marker, and protect the length of uncompressed blocks. This would have reduced the overhead, and therefore the number of false positives, in the files above by a factor of 25. For compressible files, that only need a header and a marker, the improvement is usually of 8 times less overhead per mebibyte of compressed size (about 500 times less overhead for a file of 64 MiB).

2.8 The 4 byte alignment is unjustified

Xz is the only format of the four considered here whose parts are (arbitrarily) aligned to a multiple of four bytes. The size of a xz file must also be a multiple of four bytes for no reason. To achieve this, xz includes padding everywhere; after headers, blocks, the index, and the whole stream. The bad news is that if the (useless) padding is altered in any way, "the decoder MUST indicate an error" according to the xz format specification.

Neither gzip nor lzip include any padding. Bzip2 includes a minimal amount of padding (at most 7 bits) at the end of the whole stream, but it ignores any corruption in the padding.

Xz justifies alignment as being perhaps able to increase speed and compression ratio (see [Xz format], section 5.1 'Alignment'), but such increases can't happen because:

  1. The only last filter in xz is LZMA2, whose output does not need any alignment.
  2. The output of the non-last filters in the chain is not stored in the file. Therefore it can't be "later compressed with an external compression tool" as stated in the xz format specification.

One additional problem of the xz alignment is that four bytes are not enough; the IA64 filter has an alignment of 16 bytes. Alignment is a property of each filter that can only be managed by the archiver, not a property of the whole compressed stream. Even the xz format specification recognizes that alignment of input data is the job of the archiver, not of the compressor.

The conclusion is that the 4 byte alignment is a misfeature that wastes space, increases the number of false positives for corruption, andworsens the burst error detection in the stream footer without producing any benefit at all.

2.9 Trailing data

If you want to create a compressed file and then append some data to it, for example a cryptographically secure hash, xz won't allow you to do so. The xz format specification forbids the appending of data to a file, except what it defines as 'stream padding'. In addition to telling you what you can't do with your files, defining stream padding makes xz show inconsistent behavior with respect to trailing data. Xz accepts the addition of any multiple of 4 null bytes to a file. But if the number of null bytes appended is not a multiple of 4, or if any of the bytes is non-null, then the decodermust indicate an error.

A format that reports as corrupt the only surviving copy of an important file just because cp had a glitch and appended some garbage at the end of the file is not well suited for long-term archiving. The worst thing is that the xz format specification does not offer any compliant way of ignoring such trailing data. Once a xz file gets any trailing data appended, it must be manually removed to make the file compliant again.

In a vain attempt to avoid such inconsistent behavior, xz-utils provides the option '--single-stream', which is just plain wrong for multi-stream files because it makes the decompressor ignore everything beyond the first stream, discarding any remaining valid streams and silently truncating the decompressed data:

cat file1.xz file2.xz file3.sig > file.xz
xz -d file.xz                         # indicates an error
xz -d --single stream file.xz         # causes silent data loss
xz -kd --single-stream file.xz        # causes silent truncation

The '--single-stream' option violates the xz format specification which requires the decoder to indicate an error if the stream padding does not meet its requirements. The xz format should provide a compliant way to ignore any trailing data after the last stream, just like bzip2, gzip and lzip do by default.

2.10 Xz's error detection has low accuracy

"There can be safety tradeoffs with the addition of an error-detection scheme. As with almost all fault tolerance mechanisms, there is a tradeoff between availability and integrity. That is, techniques that increase integrity tend to reduce availability and vice versa. Employing error detection by adding a check sequence to adataword increases integrity, but decreases availability. The decrease in availability happens through false-positive detections. These failures preclude the use of some data that otherwise would not have been rejected had it not been for the addition of error-detection coding". ([Koopman], p. 33).

But the tradeoff between availability and integrity is different for data transmission than for data archiving. When transmitting data, usually the most important consideration is to avoid undetected errors (false negatives for corruption), because a retransmission can be requested if an error is detected. Archiving, on the other hand, usually implies that if a file is reported as corrupt, "retransmission" is not possible. Obtaining another copy of the file may be difficult or impossible. Therefore accuracy (freedom from mistakes) in the detection of errors becomes the most important consideration.

Two error models have been used to measure the accuracy in the detection of errors. The first model consists of one or more random bit flips affecting just one byte in the compressed file. The second model consists of zeroed 512-byte blocks aligned to a 512-byte boundary, simulating a whole sector I/O error. Just one zeroed block per trial. The first model is considered the most important because bit flips happen even in the most expensive hardware [MSL].

Verification of data integrity in compressed files is different from other cases (like Ethernet packets) because the data that can become corrupted are the compressed data, but the data that are verified (the dataword) are the decompressed data. Decompression can cause error multiplication; even a single-bit error in the compressed data may produce any random number of errors in the decompressed data, or even modify the size of the decompressed data.

Because of the error multiplication caused by decompression, the error model seen by the check sequence is one of unconstrained random data corruption. (Remember that the check sequence verifies the integrity of the decompressed data). This means that the choice of error-detection code (CRC or hash) is largely irrelevant, and that the probability of an error being undetected by the check sequence (Pudc) is 1 / (2^n) for a check sequence of n bits. (See [Koopman], p. 5). Note that if some errors do not produce error multiplication, a CRC is then preferable to a hash of the same size because of the burst error detection capabilities of the CRC.

Decompression algorithms are usually able to detect some errors in the compressed data (for example a backreference to a point before the beginning of the data). Therefore, the total probability of an undetected error (Pud = false negative) is the product of the probability of the error being undetected by the decoder (Pudd) and the probability of the error being undetected by the check sequence (Pudc):Pud = Pudd * Pudc.

It is also possible that a small error in the compressed data does not alter at all the decompressed data. Therefore, for maximum availability, only the decompressed data should be tested for errors. Testing the compressed data beyond what is needed to perform the decompression increases the number of false positives much more than it can reduce the number of undetected errors.

Of course, error multiplication was not applied in the analysis of fields that are not compressed, for example 'Block Header'. Burst error detection was also considered for the 'Stream Flags' and 'Stream Footer' fields.

Trial decompressions were performed using the 'unzcrash' tool included in thelziprecover package.

The following sections describe the places in the xz format where error detection suffers from low accuracy and explain the cause of the inaccuracy in each case.

2.10.1 The 'Stream Flags' field

A well-known property of CRCs is their ability to detect burst errors up to the size of the CRC itself. Using a CRC larger than the dataword is an error because a CRC just as large as the dataword equally detects all errors while it produces a lower number of false positives.

In spite of the mathematical property described above, the 16-bit 'Stream Flags' field in the xz stream header is protected by a CRC32 twice as large as the field itself, providing an unreliable error detection where 2 of every 3 reported errors is a false positive. The inaccuracy reaches 67%. CRC16 is a better choice from any point of view. It can still detect all errors in 'Stream Flags', but produces half the false positives as CRC32.

Note that a copy of the 'Stream Flags', also protected by a CRC32, is stored in the stream footer. With such amount of redundancy xz should be able to repair a fully corrupted 'Stream Flags'. Instead of this the format specifies that if one of the copies, or one of the CRCs, or the backward size in the stream footer gets any damage, the decoder must indicate an error. The result is that getting a false positive for corruption related to the 'Stream Flags' is 7 times more probable than getting real corruption in the 'Stream Flags' themselves.

2.10.2 The 'Stream Footer' field

The 'Stream Footer' field contains the rounded up size of the index field and a copy of the 'Stream Flags' field from the stream header, both protected by a CRC32. The inaccuracy of the error detection for this field reaches a 40%; 2 of every 5 reported errors is a false positive.

The CRC32 in 'Stream Footer' provides a reduced burst error detection because it is stored at front instead of back of codeword. (See[Koopman], p. C-20). Testing has found several undetected burst errors of 31 bits in this field, while a CRC32 correctly placed would have detected all burst errors up to 32 bits. The reason adduced by the xz format specification for this misplacement is to keep the four-byte fields aligned to a multiple of four bytes, butthe 4 byte alignment is unjustified.

2.10.3 The 'Block Header' field

The 'Block Header' is of variable size. Therefore the inaccuracy of the error detection varies between 0.4% and 58%, being usually of a 58% (7 of every 12 reported errors are false positives). As shown in the graph below, CRC16 would have been a more accurate choice for any size of 'Block Header'. But inaccuracy is a minor problem compared with thelack of protection of the 'Block Header Size' and 'Block Flags' fields.

Block header CRC inaccuracy
Figure 2. Inaccuracy of block header CRC for all possible header sizes.

2.10.4 The 'Block Check' field

Xz supports several types of check sequences (CS) for the decompressed data; none, CRC32, CRC64 and SHA-256. Each check sequence provides better accuracy than the next larger one up to a certain compressed size. For the single-byte error model, the inaccuracy for each compressed size and CS size is calculated by the following formula (all sizes in bytes):

Inaccuracy = ( compressed_size * Pudc + CS_size ) / ( compressed_size + CS_size )

Applying the formula above it results that CRC32 provides more accurate error detection than CRC64 up to a compressed size of about 16 GiB, and more accurate than SHA-256 up to 112 GiB. It should be noted that SHA-256 provides worse accuracy than CRC64 for all possible block sizes.

Block check inaccuracy
Figure 3. Inaccuracy of block check up to 1 GB of compressed size.

For the zeroed-block error model, the inaccuracy curves are similar to the ones in figure 3, except that they have discontinuities because a false positive can be produced only if the last block is suitably aligned.

The results above assume that the decoder does not detect any errors, but testing shows that, on large enough files, thePudd of a pure LZMA decoder like the one in lzip is of about 2.52e-7 for the single-byte error model. More precisely, 277.24 million trial decompressions on files ranging from 1 kB to 217 MB of compressed size resulted in 70 errors undetected by the decoder (all of them detected by the CRC). This additional detection capability reduces the Pud by the same factor. (In fact the reduction of Pud is larger because 9 of the 70 errors didn't cause error multiplication; they produced just one wrong byte in the decompressed data, which is guaranteed to be detected by the CRC). The estimated Pud for lzip, based on these data, is of about 2.52e-7 * 2.33e-10 = 5.88e-17.

For the zeroed-block error model, the additional detection capability of a pure LZMA decoder is probably much larger. A LZMA stream is a check sequence in itself, and large errors seem less probable to escape detection than small ones. In fact, the lzip decoder detected the error in all the 2 million trial decompressions run with a zeroed-block. The xz decoder can't achieve such performance because LZMA2 includes uncompressed packets, where the decoder can't detect any errors.

There is a good reason why bzip2, gzip, lzip and most other compressed formats use a 32-bit check sequence; it provides for an optimal detection of errors. Larger check sequences may (or may not) reduce the number of false negatives at the cost of always increasing the number of false positives. But significantly reducing the number of false negatives may be impossible if the number of false negatives is already insignificant, as is the case in bzip2, gzip and lzip files. On the other hand, the number of false positives increases linearly with the size of the check sequence. CRC64 doubles the number of false positives of CRC32, and SHA-256 produces 8 times more false positives than CRC32, decreasing the accuracy of the error detection instead of increasing it.

Increasing the probability of a false positive for corruption in the long-term storage of valuable data is a bad idea. This is why the lzip format, designed for long-term archiving, provides 3 factor integrity checking and the decompressor reports mismatches in each factor separately. This way if just one byte in one factor fails but the other two factors match the data, it probably means that the data are intact and the corruption just affects the mismatching check sequence. GNU gzip also reports mismatches in its 2 factors separately, but does not report the exact values, making it more difficult to tell real corruption from a false positive. Bzip2 reports separately its 2 levels of CRCs, allowing the detection of some false positives.

Being able to produce files without a check sequence for the decompressed data may help xz to rank higher on decompression benchmarks, but is a very bad idea for long-term archiving. The whole idea of supporting several check types is wrong. Itfragments the format and introduces a point of failure in the xz stream; if the corruption affects the stream flags, xz won't be able to verify the integrity of the data because the type and size of the check sequence are lost.

2.11 Xz's error detection is misguided

Xz tries to detect errors in parts of the compressed file that do not affect decompression (for example in padding bytes), obviating the fact that nobody is interested in the integrity of the compressed file; it is the integrity of the decompressed data what matters. Note that the xz format specification sets more strict requirements for the integrity of the padding than for the integrity of the payload. The specificationdoes not guarantee that the integrity of the decompressed data will be verified, but it mandates that the decompression must be aborted as soon as a damaged padding byte is found. (See sections 2.2, 3.1.6, 3.3 and 4.4 of[Xz format]). Xz goes so far as to "protect" padding bytes with a CRC32. This behavior of xz just causes unnecessary data loss.

Checking the integrity of the decompressed data is important because it not only guards against corruption in the compressed file, but also against memory errors, undetected bugs in the decompressor, etc.

The only reason to be concerned about the integrity of the compressed file itself is to be sure that it has not been modified or replaced with other file. But no amount of strictness in the decompressor can guarantee that a file has not been modified or replaced. Some other means must be used for this purpose, for example an external cryptographically secure hash of the file.

2.12 Xz does not provide any data recovery means

File corruption is an unlikely event. Being unable to restore a file because the backup copy is also damaged is even less likely. But unlikely events happen constantly to somebody somewhere. This is why tools like ddrescue, bzip2recover and lziprecover exist in the first place. Lziprecover defines itself as "a last line of defense for the case where the backups are also damaged".

The safer a format is, the easier it is to develop a capable recovery tool for it. Neither xz nor gzip do provide any recovery tool. Bzip2 provides bzip2recover, which can help to manually assemble a correct file from the undamaged blocks of two or more copies. Lzip provides lziprecover, which can produce a correct file by merging the good parts of two or more damaged copies and can additionally repair slightly damaged files without the need of a backup copy.

3 Then, why some free software projects use xz?

Because evaluating formats is difficult and most free software projects are not concerned about long-term archiving, or even about format quality. Therefore they tend to use the most advertised formats. Both lzma-alone and xz have gained some popularity in spite of their defects mainly because they are associated to the popular 7-zip archiver.

This of course is sad because we software developers are among the few people who are able to understand the strengths and weaknesses of formats. We have a moral duty to choose wisely the formats we use because everybody else will blindly use whatever formats we choose.

4 Conclusions

There are several reasons why the xz compressed data format should not be used for long-term archiving, specially of valuable data. To begin with, xz is a complex container format. Using a complex format for long-term archiving would be a bad idea even if the format were well-designed, which xz is not. In general, the more complex the format, the less probable that it can be decoded in the future by a digital archaeologist. For long-term archiving, simple is robust.

Xz is fragmented by design. Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed, which makes the use of xz inadvisable not only for long-term archiving, but also for data sharing and for free software distribution. Xz is also unreasonably extensible; it has room for trillions of compression algorithms, but currently only supports one, LZMA2, which in spite of its name is not an improved version of LZMA, but an unsafe container for LZMA data. Such egregious level of extensibility makes corruption both more probable and more difficult to recover from. Additionally, the xz format lacks a version number field, which makes xz's extensibility problematic.

Xz fails to protect critical fields like length fields and flags signalling the presence of optional fields. Xz uses variable-length integers unsafely, specially when they are used to store the size of other fields or when they are concatenated together. These defects make xz fragile, meaning that most of the times when it reports a false positive, the decoder state is so mangled that it is unable to recover the decompressed data.

Error detection in the xz format is less accurate than in bzip2, gzip and lzip formats mainly because of false positives, and specially if an overkill check sequence like SHA-256 is used in xz. Another cause of false positives is that xz tries to detect errors in parts of the compressed file that do not affect decompression, like the padding added to keep the useless 4 byte alignment. In total xz reports several times more false positives than bzip2, gzip or lzip, and every false positive may result in unnecessary loss of data.

All these defects and design errors reduce the value of xz as a general-purpose format because anybody wanting to archive a file already compressed in xz format will have to either leave it as-is and face a larger risk of losing the data, or waste time recompressing the data into a format more suitable for long-term archiving.

The weird combination of unprotected critical fields, overkill check sequences, and padding bytes "protected" by a CRC32 can only be explained by the inexperience of the designers of xz. It is said that given enough eyeballs, all bugs are shallow. But the adoption of xz by several GNU/Linux distributions shows that if those eyeballs lack the required experience, it may take too long for them to find the bugs. It would be an improvement for data safety if compressed data formats intended for broad use were designed by experts and peer reviewed before publication. This would help to avoid design errors like those of xz, which are very difficult to fix once a format is in use.

5 References

6 Glossary

  • ^: Exponentiation symbol.
  • Accuracy: Freedom from mistakes. Is defined as 1 - inaccuracy.
  • Bit flip: Error that inverts a single bit value.
  • Burst Error: A set of bit errors contained within a span shorter than the length of the codeword. The burst length is the distance between the first and last bit errors. Zero or more of the intervening bits may be erroneous.
  • Check sequence: An error-detecting code value stored with associated data. A dataword plus a check sequence makes up a codeword.
  • Codeword: A dataword combined with a check sequence.
  • Dataword: The data being protected by a check sequence. A dataword can be any number of bits and is independent of the machine word size (e.g., a dataword could be 64 Megabytes).
  • False negative: Undetected error producing incorrect decompressed data.
  • False positive: Inability or refusal to decompress the data in a damaged compressed file even if the compressed data proper remain decompressible. (i.e., either the compressed data have not suffered any damage or the damage does not hinder decompression).
  • Fragility: Inability to recover the decompressed data (for example decompressing to standard output) in case of a false positive.
  • Inaccuracy: Ratio of error detection mistakes. Is defined as ( false_negatives + false_positives ) / total_cases.
  • Overhead: Everything in the file except the compressed data proper. (Headers, check sequences, padding, etc).
  • Overkill: An unnecessary excess of whatever is needed to achieve a goal.
  • Pud: Probability of an undetected error (false negative). Pud is equal to Pudd * Pudc.
  • Pudc: Probability of an error undetected by the check sequence.
  • Pudd: Probability of an error undetected by the decoder.
  • Unsafe: 1 Likely to cause severe data loss even in case of the smallest corruption (a single bit flip). 2 Likely to produce false negatives.

Copyright © 2016 Antonio Diaz Diaz.

You are free to copy and distribute this article without limitation, but you are not allowed to modify it.

First published: 2016-06-11
Updated: 2017-07-03

Swedish DJ Avicii dies at 28

$
0
0
DJ AviciiImage copyrightGetty Images
Image caption DJ Avicii - real name Tim Bergling - died on Friday, his representative announced

Swedish DJ Avicii, who has collaborated with the likes of Madonna and Coldplay, has died in Oman at the age of 28.

Avicii's club anthems include Wake Me Up, Hey Brother, and recently, Lonely Together with Rita Ora.

His representative said in a statement: "It is with profound sorrow that we announce the loss of Tim Bergling, also known as Avicii.

"The family is devastated and we ask everyone to please respect their need for privacy in this difficult time."

No cause of death was announced, and Avicii's representative said no further statements would be issued.

Avicii had struggled with some health issues in the past, having his appendix and gall bladder removed in 2014.

He announced his retirement from touring in 2016, partly because of the health problems.

"I know I am blessed to be able to travel all around the world and perform, but I have too little left for the life of a real person behind the artist," he said at the time.

Who was Avicii?

  • One of the biggest names in dance music of the last 10 years, he had a catalogue full of pumping, uplifting, house smashes
  • He started his career when he won a production competition held by Pete Tong in 2008
  • He went on to notch up 11 billion streams on Spotify and was the first EDM DJ to stage a worldwide arena tour
  • He was nominated for two Grammy Awards and had nine UK top 10 singles, including two number ones
  • He suffered from health problems including acute pancreatitis, in part due to excessive drinking

He later announced a return to the studio, and released a new self-titled EP in 2017.

Other leading electronic artists wrote tributes to Bergling after the news of his death.

Music artist Dua Lipa tweeted: "Such sad news to hear about Avicii passing. Too young and way too soon. My condolences go out to his family, friends and fans."

CoinTracker (YC W18) Is Hiring Engineer #1

$
0
0
CoinTracker (YC W18) Is Hiring Engineer #1
6 minutes ago | hide
Hey HN!

CoinTracker is a portfolio & tax manager for cryptocurrency. We are looking for Engineer #1 to join our team in SF (https://www.cointracker.io/about). A few interesting stats:

+ We already have 30,000 connected exchange accounts tracking over $200M in crypto assets + We have a live product that is generating revenue (ramen profitable) + We are venture backed by Y Combinator, Initialized Capital (Garry Tan & Alexis Ohanian), the first seed investors in Coinbase, and others + We are technical, product-driven, and move fast

If you or someone you know is excited about making cryptocurrency accessible to more people, please chat with us!


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Show HN: Faster.js – a micro-optimizing JavaScript compiler

$
0
0

README.md

NPM VersionBuild Status

faster.js is a Babel plugin that compiles idiomatic Javascript to faster, micro-optimized Javascript using ideas inspired by the fast.js library.

Installation

Setup Babel for your project if you haven't already. Then install faster.js:

npm install --save-dev faster.js

Usage

.babelrc
{"plugins": ["faster.js"]
}
Babel CLI
babel-cli --plugins faster.js script.js
webpack.config.js (Webpack 4)
module: {
  rules: [{
    test:/\.js$/,
    exclude:/(node_modules)/,
    use: {
      loader:'babel-loader',
      options: {
        plugins: [require('faster.js')]
      }
    }
  }]
}

What faster.js does

faster.js rewrites common Array method calls to faster code that does the same thing (usually - see When NOT to use faster.js). This results in performance boosts (especially on code that relies heavily on Array methods) while maintaining code readability, but comes at the cost of a slightly larger bundle size. If having a small Javascript bundle size is much more important for you than performance is, you should not use faster.js.

Supported Array methods

faster.js will rewrite the following Array methods when possible:

  • .every()
  • .filter()
  • .forEach()
  • .map()
  • .reduce()
  • .reduceRight()
  • .some()

Demo

faster.js Demo Screenshot

Try it yourself: https://fasterjs-demo.victorzhou.com

Demo Github repo: https://github.com/vzhou842/faster.js-demo

⚠️ When NOT to use faster.js

faster.js makes two critical assumptions that MUST be true about your codebase:

1. Sparse Arrays are never used.

Code compiled with faster.js may produce incorrect results when run on sparse arrays.

2. Restricted methods are only ever called on native Javascript arrays:

faster.js assumes any restricted method call is done on a native Javascript array. Any new classes you write should not include methods with restricted names.

Restricted method names are the names of methods that faster.js will attempt to rewrite - see Supported Array methods.

// OKconsta= [1, 2, 3].map(e=>2* e);// BADclassFoo {constructor(map) {this._map= map;
  }map() {returnthis._map;
  }
}constf=newFoo({});constmap=f.map(); // .map() is a restricted method

How faster.js works

faster.js exploits the fact that native Javascript Array methods are slowed down by having to support seldom-used edge cases like sparse arrays. Assuming no sparse arrays, there are often simple ways to rewrite common Array methods to improve performance.

Example: Array.prototype.forEach()

// Original codeconstarr= [1, 2, 3];constresults=arr.map(e=>2* e);

roughly compiles to

// Compiled with faster.jsconstarr= [1, 2, 3];constresults= [];const_f= (e=>2* e);for (let _i =0; _i <arr.length; _i++) {results.push(_f(arr[_i], _i, arr));
}

Benchmarks

Example benchmark output (condensed)

$ npm run bench

  array-every large
    ✓ native x 2,255,548 ops/sec ±0.46% (57 runs sampled)
    ✓ faster.js x 10,786,892 ops/sec ±1.25% (56 runs sampled)
faster.js is 378.2% faster (0.351μs) than native

  array-filter large
    ✓ native x 169,237 ops/sec ±1.42% (55 runs sampled)
    ✓ faster.js x 1,110,629 ops/sec ±1.10% (59 runs sampled)
faster.js is 556.3% faster (5.008μs) than native

  array-forEach large
    ✓ native x 61,097 ops/sec ±3.66% (43 runs sampled)
    ✓ faster.js x 200,459 ops/sec ±0.52% (55 runs sampled)
faster.js is 228.1% faster (11.379μs) than native

  array-map large
    ✓ native x 169,961 ops/sec ±1.51% (57 runs sampled)
    ✓ faster.js x 706,781 ops/sec ±0.64% (59 runs sampled)
faster.js is 315.8% faster (4.469μs) than native

  array-reduce large
    ✓ native x 200,425 ops/sec ±1.01% (55 runs sampled)
    ✓ faster.js x 1,694,350 ops/sec ±1.52% (55 runs sampled)
faster.js is 745.4% faster (4.399μs) than native

  array-reduceRight large
    ✓ native x 49,784 ops/sec ±0.38% (58 runs sampled)
    ✓ faster.js x 1,756,352 ops/sec ±0.99% (59 runs sampled)
faster.js is 3428.0% faster (19.517μs) than native

  array-some large
    ✓ native x 2,968,367 ops/sec ±0.56% (56 runs sampled)
    ✓ faster.js x 11,591,773 ops/sec ±1.29% (54 runs sampled)
faster.js is 290.5% faster (0.251μs) than native

FAQ

What is a sparse array?

Sparse arrays are arrays that contain holes or empty slots.

constsparse1= [0, , 1]; // a sparse array literalconsole.log(sparse1.length); // 3constsparse2= [];
sparse2[5] =0; // sparse2 is now a sparse arrayconsole.log(sparse2.length); // 6

It is generally recommended to avoid using sparse arrays.


Wells Fargo Hit with $1B in Fines

$
0
0

The Consumer Financial Protection Bureau is levying a $1 billion fine against Wells Fargo in punishment for the banking giant's actions in its mortgage and auto loan businesses. Spencer Platt/Getty Imageshide caption

toggle caption
Spencer Platt/Getty Images

The Consumer Financial Protection Bureau is levying a $1 billion fine against Wells Fargo ” a record for the agency ” in punishment for the banking giant's actions in its mortgage and auto loan businesses.

Wells Fargo's "conduct caused and was likely to cause substantial injury to consumers," the agency said in its filings about the bank.

Wells Fargo broke the law by charging some consumers too much over mortgage interest rate-lock extensions, as well as by running a mandatory insurance program that added insurance costs and fees into some borrowers' auto loans, the CFPB said.

Announcing the penalty on Friday, the CFPB said that it is part of a settlement with Wells Fargo, which has also pledged to repair the financial harm to consumers.

Because of the penalties, Wells Fargo says, it is adjusting its preliminary financial results for the first quarter of 2018, shifting $800 million in its balance sheet ” and dropping its net income for the quarter to $4.7 billion.

The new federal action against the bank comes less than two years after Wells Fargo was fined nearly $200 million over what the CFPB called "the widespread illegal practice of secretly opening unauthorized deposit and credit card accounts."

Those earlier penalties included a $100 million fine by the CFPB ” a record at the time. The new punishment stems from the agency's findings that Wells Fargo abused its relationship with home and auto loan borrowers.

Wells Fargo was also punished by the U.S. Office of the Comptroller of the Currency over its risk management practices, with the agency collecting a $500 million penalty as part of the fines announced on Friday.

Along with treating its customers unfairly, the OCC said, Wells Fargo had failed to maintain a compliance risk management program that was appropriate for a bank of its size and complexity.

That failure, the OCC said, led Wells Fargo to "engage in reckless unsafe or unsound practices and violations of law."

Auto Loans, Insurance And Fees

Problems in the way the Wells Fargo auto loan unit handled consumers' accounts exposed people to hundreds or thousands of dollars in premiums and fees. The issues were also found to have possibly contributed to thousands of cars being repossessed.

The CFPB said that problems with the auto loan unit persisted for more than 10 years, from October of 2005 to September of 2016.

Lenders can require borrowers to maintain insurance on their vehicles ” and if a borrower doesn't do that, there's a process that allows lenders to arrange for what's called Force-Placed Insurance, and add that cost to the loan. But Wells Fargo acknowledged that of the roughly 2 million car loans that it put into that program, it "forcibly placed duplicative or unnecessary insurance on hundreds of thousands of those borrowers' vehicles."

For some borrowers, the bank also improperly maintained those force-placed policies on their accounts even after they secured adequate insurance.

The CFPB said, "If borrowers failed to pay the amounts [Wells Fargo] charged them for the Force-Placed Insurance, they faced additional fees and, in some instances, experienced delinquency, loan default, and even repossession."

In one five-year period from 2011 to 2016, Wells Fargo acknowledged in the settlement, the extra costs of force-placed insurance may have played a role in at least 27,000 customers having their vehicles repossessed.

Home Loan Rate Locks

Wells Fargo failed to follow its own policies in how it charged fees over locking in mortgage interest rates beyond the standard guaranteed window, the CFPB said, adding that the bank charged customers for the rate extension ” even in cases where the bank itself was the reason for delays in closing on a home loan.

The problems persisted for several years after the bank's internal audit identified the risk of harming consumers, according to the government's filing about the settlement.

Wells Fargo unfairly and inconsistently applied its policy on rate-extension fees in a period that ran from September of 2013 through February of 2017, the agency said.

In addition to the record settlement announced on Friday, Wells Fargo also faced a government reprimand in February, when the Federal Reserve took the rare step of "restricting Wells Fargo's growth and demanding the replacement of four board members in response to "widespread consumer abuses and compliance breakdowns" at the bank," as NPR reported.

At the time, the Fed faulted Wells Fargo for maintaining a business strategy that prioritizes its own growth at the expense of risk management.

The Key to Everything

$
0
0
Johnny Miller/Unequal Scenes/Thomson Reuters FoundationCiudad Nezahualcóyotl, part of greater Mexico City, 2016

Geoffrey West spent most of his life as a research scientist and administrator at the Los Alamos National Laboratory, running programs concerned not with nuclear weapons but with peaceful physics. After retiring from Los Alamos, he became director of the nearby Santa Fe Institute, where he switched from physics to a broader interdisciplinary program known as complexity science. The Santa Fe Institute is leading the world in complexity science, with a mixed group of physicists, biologists, economists, political scientists, computer experts, and mathematicians working together. Their aim is to reach a deep understanding of the complexities of the natural environment and of human society, using the methods of science.

Scale is a progress report, summarizing the insights that West and his colleagues at Santa Fe have achieved. West does remarkably well as a writer, making a complicated world seem simple. He uses pictures and diagrams to explain the facts, with a leisurely text to put the facts into their proper setting, and no equations. There are many digressions, expressing personal opinions and telling stories that give a commonsense meaning to scientific conclusions. The text and the pictures could probably be understood and enjoyed by a bright ten-year-old or by a not-so-bright grandparent.

The title, Scale, needs some clarification. To explain what his book is about, West added the subtitle “The Universal Laws of Growth, Innovation, Sustainability, and the Pace of Life in Organisms, Cities, Economies, and Companies.” The title tells us that the universal laws the book lays down are scaling laws. The word “scale” is a verb meaning “vary together.” Each scaling law says that two measurable quantities vary together in a particular way.

We suppose that the variation of each quantity is expressed as a percentage rate of increase or decrease. The scaling law then says that the percentage rate for quantity A is a fixed number k times the percentage rate for quantity B. The number k is called the power of the scaling law. Since the percentage changes of A and B accumulate with compound interest, the scaling law says that A varies with the kth power of B, where now the word “power” has its usual mathematical meaning. For example, if a body is falling without air resistance, the scaling law between distance fallen and time has k=2. The distance varies with the square of time. You fall 16 feet in one second, 64 feet in two seconds, 144 feet in three seconds, and so on.

Another classic example of a scaling law is the third law of planetary motion, discovered by the astronomer Johannes Kepler in 1618. Kepler found by careful observation that the time it takes for a planet to orbit the sun scales with the three-halves power of the diameter of its orbit. That means that the square of the time is proportional to the cube of the distance. Kepler measured the periods and diameters of the orbits of the six planets known in his time, and found that they followed the scaling law precisely. Fifty-nine years later, Isaac Newton explained Kepler’s laws of planetary motion as consequences of a mathematical theory of universal gravitation. Kepler’s laws gave Newton the essential clues that led to the theoretical understanding of the physical universe.

There is a scaling law in biology as important as Kepler’s third law in astronomy. It ought to have the name of Motoo Kimura attached to it, since he was the first to understand its importance, but instead it is known as the law of genetic drift. Genetic drift is one of the two great driving forces of evolution, the other being natural selection. Darwin is rightly honored for his understanding of natural selection as a main cause of evolution, but he failed to include genetic drift in his picture because he knew nothing about genes.

Genetic drift is the change in the average composition of a population due to random mutations of individual genes. Genetic drift causes species to evolve even in the absence of selection. Genetic drift and natural selection work together to drive evolution, selection being dominant when populations are large, genetic drift being dominant when populations are small.

Genetic drift is particularly important for the formation of new species, when populations may remain small for a long time. The predominance of genetic drift for small populations is due to a simple scaling law. Genetic drift scales with the inverse square root of population. This means that genetic drift is ten times faster for a population of ten thousand than for a population of a million. The scaling is the same for any kind of random mutations. If we observe any measurable quantity such as height, running speed, age at puberty, or intelligence test score, the average drift will vary with the inverse square root of population. The square root results from the statistical averaging of random events.

West is now making a huge claim: that scaling laws similar to Kepler’s law and the genetic drift law will lead us to a theoretical understanding of biology, sociology, economics, and commerce. To justify this claim he has to state the scaling laws, display the evidence that they are true, and show how they lead to understanding. He does well with the first and second tasks, not so well with the third. The greater part of the book is occupied with stating the laws and showing the evidence. Little space is left over for explaining. The Santa Fe observers know how to play the part of a modern-day Kepler, but they do not come close to being a modern-day Newton.

The history of each branch of science can be divided into three phases. The first phase is exploration, to see what nature is doing. The second phase is precise observation and measurement, to describe nature accurately. The third phase is explanation, to build theories that enable us to understand nature. Physics reached the second phase with Kepler, the third phase with Newton. Complexity science as West defines it, including economics and sociology, remained in the first phase until about the year 2000, when the era of big data began. The era started abruptly when information became cheaper to store in electronic form than to discard. Storing information can be an automatic process, while discarding it usually requires human judgment. The cost of information storage has decreased rapidly while the cost of information discard has decreased slowly. Since 2000, the world has been inundated with big data. In every science as well as in business and government, databases have been storing immense quantities of information. Information now accumulates much faster than our ability to understand it.

Complexity science at the Santa Fe Institute is driven by big data, providing abundant information about ecological and human affairs. Humans can visualize big data most easily when it is presented in the form of scaling laws—hence the main theme of West’s book. But a collection of scaling laws is not a theory. A theory of complexity would give us answers to deeper questions. Why are there ten thousand species of birds on this planet but only five thousand species of mammals? Why are there warm-blooded animals but no warm-blooded plants? Why are human societies so often engaged in deadly quarrels? What is the destiny of our species? These are questions that big data may illuminate but cannot answer. If complexity science ever moves into the third phase, some of these old questions will be answered, and new questions will arise.

West’s first chapter, “The Big Picture,” sets the stage for the detailed discussions that follow, with a section called “Energy, Metabolism, and Entropy,” explaining how one of the basic laws of physics, the second law of thermodynamics, makes life precarious and survival difficult. Entropy is disorder. The second law states that entropy inexorably increases in any closed system. West comments, “Like death, taxes, and the Sword of Damocles, the Second Law of Thermodynamics hangs over all of us and everything around us…. Entropy kills.” His big picture is seriously one-sided. He does not mention the other side of the picture, the paradox of order and disorder—the fact that, in the real worlds of astronomy and biology, ordered structures emerge spontaneously from disorder. The solar system, in which planets move in an orderly fashion around the sun, emerged from a disordered cloud of gas and dust. The fearful symmetry of the tiger and the beauty of the peacock emerge from a dead and disordered planet.

The astronomer Fang Lizhi published with his wife, Li Shuxian, a popular book, Creation of the Universe (1989), which includes the best explanation that I have seen of the paradox of order and disorder.1 The explanation lies in the peculiar behavior of gravity in the physical world. On the balance sheet of energy accounting, gravitational energy is a deficit. When you are close to a massive object, your gravitational energy is minus the amount of energy it would take to get away from the mass all the way to infinity. When you walk up a hill on the earth, your gravitational energy is becoming less negative, but never gets up to zero. Any object whose motions are dominated by gravity will have energy decreasing as temperature increases and energy increasing as temperature decreases.

As a consequence of the second law of thermodynamics, when energy flows from one such object to another, the hot object will grow hotter and the cold object will grow colder. That is why the sun grew hotter and the planets grew cooler as the solar system evolved. In every situation where gravity is dominant, the second law causes local contrasts to increase together with entropy. This is true for astronomical objects like the sun, and also for large terrestrial objects such as thunderstorms and hurricanes. The diversity of astronomical and terrestrial objects, including living creatures, tends to increase with time, in spite of the second law. The evolution of natural ecologies and of human societies is a part of this pattern. West is evidently unaware of Fang and Li’s insight.

The factual substance of West’s book is contained in eighty-one numerical diagrams, displaying a large number of scaling laws obeyed by various observed quantities. The first diagram, concerning the metabolic rate of animals, shows twenty-eight dots, each labeled with the name of a warm-blooded animal species, beginning with mouse and ending with elephant. The dots are displayed on a square graph, the horizontal position of the dot showing the average body mass of the species and the vertical position showing its average rate of consumption of energy. The diagram shows the twenty-eight points lying with amazing accuracy on a single straight line. The slope of the line on the page demonstrates the scaling law relating energy consumption to mass. Energy consumption scales with the three-quarters power of mass. The fourth power of energy consumption scales with the cube of mass. This scaling law holds accurately for mammals and birds. Cold-blooded animals such as fish and reptiles are excluded because they have no fixed body temperature. Their consumption of energy varies with their temperature, and their temperature varies with the weather.

Similar diagrams display similar scaling laws obeyed by other quantities. These laws are generally most accurate for anatomy and physiology of animals, less accurate for social institutions such as cities and companies. Figure 10 shows heart rates of mammals scaling inversely with the one-quarter power of mass. Figure 35 shows the number of patents awarded in the United States scaling with the 1.15 power of the size of the population. Figure 36 shows the number of crimes reported in cities in Japan scaling with the 1.2 power of population. Figure 75 shows that commercial companies in the United States have a constant death rate independent of age—the life expectancy of a company at any age is about ten years. The short lifetime of companies is an essential feature of capitalist economics, with good and bad consequences. The good effect is to get rid of failed enterprises, which in socialist economies are difficult to kill and continue to eat up resources. The bad effect is to remove incentives for foresight and long-range planning.

The closest that West comes to a theory of complexity is his discussion of fractals. A fractal is a structure with big and small branches that look similar at all sizes, like a tree or the blood-vessels of a mammal. When you magnify a picture of a small piece of it, the result looks like the whole thing. The mathematician Benoit Mandelbrot began the study of fractals in the 1960s and called attention to the ubiquity of fractals in nature. Since fractal structure is independent of scale, it leads naturally to scaling laws. West discusses in detail the example of the mammalian blood-vessel system, whose fractal branching evolved to optimize the distribution of nutrients through one-dimensional vessels in three-dimensional tissues. Optimal branching results in the observed scaling law, the total blood flow scaling with the three-quarters power of the mass. Most of the scaling laws in biology can be understood in a similar way as resulting from the fractal structure of tissues.

But this theoretical discussion of fractals is not a theory of complexity. Fractals have the simplest kind of complex structures, with rigid rules of construction. Accurate scaling laws result from simplicity, not from complexity. When West moves from biology to economics and sociology, the fractal structure is less clear and the scaling laws become less accurate. Cities and companies have structures that are only roughly hierarchical and not dictated by theory.

West loves big cities and uses his scaling laws to demonstrate their superiority as habitats for human societies. In a chapter entitled “Prelude to a Science of Cities,” he writes:

The great metropolises of the world facilitate human interaction, creating that indefinable buzz and soul that is the wellspring of its innovation and excitement and a major contributor to its resilience and success.

This lyrical view of modern cities is widely shared, and explains part of the enormous growth of cities. During the present century, billions of people will move from villages to cities, and the population of the planet will become increasingly urban.

West presents in his Figure 45 the scaling law relating the number of telephone conversations in cities to the number of inhabitants. The number of conversations scales with the 1.15 power of the population. The law is exactly the same in the two countries, Britain and Portugal, that maintain the most complete record of telephone calls. West considers telephone conversations to be a good indication of quality of life. More conversations mean more social interaction, more business deals, more exchange of ideas—more opportunities for individuals to push the society forward. His word “buzz” expresses his vision of the great city as the place where human progress happens. He sees the nonlinear scaling law confirming his view that the great city empowers each individual inhabitant to be a more effective innovator.

GrangerWoodcut from Kepler’s ‘Mysterium Cosmographicum’, 1596

West does not mention another scaling law that works in the opposite direction. That is the law of genetic drift, mentioned earlier as a crucial factor in the evolution of small populations. If a small population is inbreeding, the rate of drift of the average measure of any human capability scales with the inverse square root of the population. Big fluctuations of the average happen in isolated villages far more often than in cities. On the average, people in villages are not more capable than people in cities. But if ten million people are divided into a thousand genetically isolated villages, there is a good chance that one lucky village will have a population with outstandingly high average capability, and there is a good chance that an inbreeding population with high average capability produces an occasional bunch of geniuses in a short time. The effect of genetic isolation is even stronger if the population of the village is divided by barriers of rank or caste or religion. Social snobbery can be as effective as geography in keeping people from spreading their genes widely.

A substantial fraction of the population of Europe and the Middle East in the time between 1000 BC and 1800 AD lived in genetically isolated villages, so that genetic drift may have been the most important factor making intellectual revolutions possible. Places where intellectual revolutions happened include, among many others, Jerusalem around 800 BC (the invention of monotheistic religion), Athens around 500 BC (the invention of drama and philosophy and the beginnings of science), Venice around 1300 AD (the invention of modern commerce), Florence around 1600(the invention of modern science), and Manchester around 1750 (the invention of modern industry).

These places were all villages, with populations of a few tens of thousands, divided into tribes and social classes with even smaller populations. In each case, a small starburst of geniuses emerged from a small inbred population within a few centuries, and changed our ways of thinking irreversibly. These eruptions have many historical causes. Cultural and political accidents may provide unusual opportunities for young geniuses to exploit. But the appearance of a starburst must be to some extent a consequence of genetic drift. The examples that I mentioned all belong to Western cultures. No doubt similar starbursts of genius occurred in other cultures, but I am ignorant of the details of their history.

West’s neglect of villages as agents of change raises an important question. How likely is it that significant numbers of humans will choose to remain in genetically isolated communities in centuries to come? We cannot confidently answer this question. The answer depends on unpredictable patterns of economic development, on international politics, and on even more unpredictable human desires. But we can foresee two possible technological developments that would result in permanent genetic isolation of human communities.

One possibility is that groups of parents will be able to give birth to genetically modified children, hoping to give them advantages in the game of life. The children might be healthier or longer-lived or more intellectually gifted than other children, and they might no longer interbreed with natural-born children. The other possibility is that groups of people will emigrate from planet Earth and build societies far away in the depths of space. West considers neither of these possibilities. His view of the future sees humans remaining forever a single species confined to a single planet. If the future resembles the past, humans will be diversifying into many species and spreading out over the universe, as our hominin ancestors diversified and spread over this planet.

So long as we remain on planet Earth, there are strong social, political, and ethical reasons to forbid genetic modification of children by parents. If we are scattered in isolated communities far away, those reasons would no longer be relevant to our experience. A group of humans colonizing a cold and airless world would probably not hesitate to use genetic engineering to adapt their children to the environment. Konstantin Tsiolkovsky, the nineteenth-century prophet of space colonization, already imagined the colonists endowed with green leaves to replace lungs and with moving-picture skin patterns to replace voices. How long will it take for the technologies of space transportation and genetic engineering to bring Tsiolkovsky’s dreams to reality?

Advances in technology are unpredictable, but two hundred years is a reasonable guess for cheap and widely available space travel and genetically modified babies—perhaps one hundred years to develop the science and another hundred years to develop the applications. It is likely that in two hundred years public highways will be carrying passengers and freight around the solar system, with a large enough volume of traffic to make them affordable to ordinary people. At the same time, farmers will be breeding microbes, as well as plants and animals designed to live together in robust artificial ecologies. The option to include humans in the ecology will always be available.

Cheap space travel requires two kinds of public highways, one for escape from high-gravity planets such as Earth, the other for long-distance travel between low-gravity destinations. The high-gravity highway could be a powerful laser beam pointing upward from the ground into space, with spacecraft taking energy from the beam to fly up and down. If the volume of traffic is large enough to keep the beam active, the energy cost per vehicle would be comparable with the energy cost of intercontinental travel by jet planes today. The low-gravity highway could be a system of refueling stations for spacecraft driven by ion-jet engines using sunlight as an energy source. Both the high-gravity and the low-gravity systems are likely to grow within two hundred years if we do not invent something better in the meantime.

Cheap deep-space survival requires genetic engineering of warm-blooded plants. These could grow on the surface of any cold object in the solar system, using energy from the distant sun, water, and other essential nutrients from the frozen soil. A plant would be a living greenhouse, with cold mirrors outside concentrating sunlight onto transparent windows, and roots and shoots inside the greenhouse kept warm by the sunlight. Inside the greenhouse would be a cavity filled with breathable air at a comfortable temperature, serving as a habitat for a diverse ecology of microbes, plants, animals, and humans. The warm-blooded plants could grow the mirrors and the greenhouses and provide nourishment for the whole community. Small objects in the solar system, such as asteroids and comets and satellites, have enough surface area to provide homes for a much larger population than Earth. If ever the solar system becomes overcrowded, life can spread out further, over the galaxy and the universe.2

A chapter in Scale with the title “The Vision of a Grand Unified Theory of Sustainability” gives us West’s view of the future. He sees the rapid growth of big cities and big data causing human activities to scale with time at super-exponential speeds. The acceleration cannot be sustained, since it would lead to a mathematical singularity, with observed quantities becoming infinite within a finite time. The idea of the singularity, an imminent world crisis driven by the explosive growth of artificial intelligence, was promoted by Ray Kurzweil in his book The Singularity Is Near (2005). It is generally regarded as belonging to science fiction rather than to science, but West takes it seriously as a consequence of known scaling laws.

The approaching singularity would force a radical change in the organization of human society, to make our existence sustainable. But the scaling laws would again result in another singularity, forcing another radical change. West foresees a future of repeated approaches to one singularity after another, until the Grand Unified Theory of Sustainability teaches us how to build a truly sustainable society. He leaves the description of the permanent sustainable society to our imagination. The only feature he insists on is the Grand Unified Theory, which will set the rules of human behavior for an endless future. The theory will govern our lives, so that we will be compelled to live within our means.

The last time humans invented a grand unified theory to make our existence sustainable was when Karl Marx came up with dialectical materialism. The theory had great success in changing human behavior over large areas of our planet. But the changes did not prove to be sustainable, and the theory did not remain unified. It seems likely that West’s theory will run into similar difficulties.

The choice of an imagined future is always a matter of taste. West chooses sustainability as the goal and the Grand Unified Theory as the means to achieve it. My taste is the opposite. I see human freedom as the goal and the creativity of small human societies as the means to achieve it. Freedom is the divine spark that causes human children to rebel against grand unified theories imposed by their parents.

Rethinking GPS: Engineering Next-Gen Location at Uber

$
0
0

Location and navigation using global positioning systems (GPS) is deeply embedded in our daily lives, and is particularly crucial to Uber’s services. To orchestrate quick, efficient pickups, our GPS technologies need to know the locations of matched riders and drivers, as well as provide navigation guidance from a driver’s current location to where the rider needs to be picked up, and then, to the rider’s chosen destination. For this process to work seamlessly, the location estimates for riders and drivers need to be as precise as possible.  

Since the (literal!) launch of GPS in 1973, we have advanced our understanding of the world, experienced exponential growth in the computational power available to us, and developed powerful algorithms to model uncertainty from fields like robotics. While our lives have become increasingly dependent on GPS, the fundamentals of how GPS works have not changed that much, which leads to significant performance limitations. In our opinion, it is time to rethink some of the starting assumptions that were true in 1973 regarding where and how we use GPS, as well as the computational power and additional information we can bring to bear to improve it.

While GPS works well under clear skies, its location estimates can be wildly inaccurate (with a margin of error of 50 meters or more) when we need it the most: in densely populated and highly built-up urban areas, where many of our users are located. To overcome this challenge, we developed a software upgrade to GPS for Android which substantially improves location accuracy in urban environments via a client-server architecture that utilizes 3D maps and performs sophisticated probabilistic computations on GPS data available through Android’s GNSS APIs.

In this article, we discuss why GPS can perform poorly in urban environments and outline how we fix it using advanced signal processing algorithms deployed at scale on our server infrastructure.

Figure 1: The above GIF offers a comparison of standard GPS (red) against our improved location estimate (blue) for a pickup from Uber HQ in San Francisco. Our estimated location closely follows the true path taken by the rider, while GPS shows very large excursions.

A bit of background on GPS/GNSS

Before discussing our approach in detail, let us do a quick recap of how GPS works in order to understand why it can be inaccurate in high-rise urban environments.

GPS is a network of more than 30 satellites operated by the U.S. government, orbiting the earth at an altitude of about 20,000 kilometers. (Most cell phones these days can pick up similar Russian “GLONASS” satellites too.)  These satellites send out radio frequency signals that GPS receivers, such as those found in cell phones, can lock onto. Importantly, these satellites advertise the time at which they launch their signals.

For each satellite whose signal the receiver processes, the difference between reception time and launch time (time-of-flight), multiplied by the speed of light, is called the pseudorange. If the satellite and receiver’s clocks are synchronized, and the signal travels along the straight line-of-sight path, then this would equal the actual distance to the satellite. However, the clocks are not synchronized, so the receiver needs to solve for four unknowns, its own 3D coordinates on the globe, and its clock bias. Thus, we need a minimum of four satellites (four equations) to solve for these four unknowns.

If we ignore clock bias, we can intuitively interpret the location estimate performed by the GPS receiver by intersecting spheres centered at the satellites with the radius of each sphere given by the pseudorange. In practice, a GPS receiver processes signals from a significantly larger number of satellites (up to 20 GPS and GLONASS satellites are visible in an open field), and having more than the minimum number of equations provides extra robustness to noise, blockages, etc. In addition to GPS and GLONASS, some new/future receivers can/will process signals from other satellite systems.  Other navigation satellite systems coming online are Galileo, operated by the European Union, IRNSS in India, and BeiDou, operated by China. The more general term GNSS (global navigation satellite systems) encompasses these systems. (We will use this term in the remainder of the article.)  

Figure 2: In this simplified interpretation of GPS receiver computation, spheres intersect at the center of known satellite locations.

Why GNSS location is inaccurate in urban environments

A major assumption behind GNSS-based positioning is that the receiver has a direct line-of-sight to each satellite whose pseudorange it is computing. This works seamlessly in open terrain but really breaks down in urban environments, as shown in Figure 3, below:

Figure 3: Line-of-sight blockage and strong reflections can cause large GPS errors.

Buildings often block the lines of sight to satellites, so the receiver frequently processes signals corresponding to strong reflections off of other buildings. The significant inaccuracy (positive offsets) in pseudoranges resulting from this phenomenon can lead to errors in position estimates that can be 50 meters or more in urban canyons. Most of us who have wandered,  driven around, or requested an Uber in big cities have experienced these problems first hand.

Satellite signal strengths to the rescue

Our approach to improving location accuracy makes a feature out of the very blockage of GNSS signals that causes trouble for standard receivers.  How? For Android phones, the LocationManager API provides not just the phone’s position estimate, but also the signal-to-noise ratio (SNR) for each GNSS satellite in view. If we put this “signal strength” information together with 3D maps, then we can obtain very valuable location information. Figure 4, below, shows a simplified version of how satellite SNRs and 3D maps can be used to infer which side of the street we are on:

Figure 4: Satellite signal strengths, when combined with 3D maps, provide valuable location information.

Zooming into the details, our approach relies on putting the following intuition in a mathematical framework: if the SNR for a satellite is low, then the line-of-sight path is probably blocked or shadowed; if the SNR is high, then the LOS is probably clear. The qualifier “probably” is crucial here: even when the receiver is in a shadowed area, strong reflected signals can still reach it, and even if it is in a clear area, the received signal can be weak (because of destructive interference between LOS and reflected paths, a phenomenon referred to as multipath fading).  Also, in general, the 3D map is not entirely accurate, and certainly does not capture random blockages by large moving objects not in the map, like trucks. This adds additional uncertainty to the process.  

 

Probabilistic shadow matching using ray tracing

While the intuition on satellite signal strengths carrying useful location information is sound, it must be fleshed out within a probabilistic framework. For any possible location for the receiver, we can check whether the ray joining the location to the satellite is blocked using our 3D map. Now, using a model for the probability distribution of the SNR under LOS and shadowed conditions, we determine the likelihood of the SNR measured for that satellite. For example, if the location is shadowed, then the likelihood of a high SNR is low. The overall likelihood of a given location, based on the satellite SNRs, is the product of the likelihoods corresponding to the different satellites. By doing this over a grid of possible locations, we obtain a likelihood surface—or heat map—of possible receiver locations, based on satellite signal strengths alone. We call this procedure probabilistic shadow matching.

Figure 5: Ray tracing from one possible location to each satellite for probabilistic shadow matching. This is done for thousands of hypothesized locations.

The likelihood surface, or heat map, from probabilistic shadow matching summarizes the information from satellite SNR measurements. However, as we see from Figure 6 below, this heat map can be pretty complicated. It can have many distinct, widely separated hotpots (local maxima) often corresponding to a given side of the street, but sometimes still in the wrong location (i.e., phantoms).  In order to narrow down our location estimate and to avoid locking onto the phantoms, we must now fuse this information with even more information.

Figure 6. A location heat map computed based on satellite signal strengths can have many hotspots. In the above example, our improved location estimate (blue path, black uncertainty ellipse) follows ground truth (yellow path), whereas standard GPS (red path, gray uncertainty ellipse) is inaccurate.

Information fusion via particle filter

For Android phones, the information we use in addition to satellite signal strengths is usually just the standard GNSS position fix, but can also be Android Fused locations, which may include WiFi-based positioning. Since this location can be very inaccurate, single time instant (one-shot) fusion of GNSS fix with shadow matching likelihoods typically leads to poor performance.  In order to take advantage of the information from satellite signal strengths, we trust GPS less in built-up areas (the gray GPS uncertainty ellipse in Figure 6 is a typical model that we use, while the black uncertainty ellipse for improved GPS is an output of our algorithm). We therefore use past measurements and constrain the location evolution over time using a motion model adapted to the application (e.g., pedestrian vs. vehicular motion). This is accomplished by using a particle filter, which approximates the probability distribution of the receiver’s location at any given time by a set of weighted particles. In other words, we estimate where the phone is using thousands of hypothesized locations (i.e., particles).  

Over time, the probability weights and particle locations evolve based on the measurements and the motion model. Since the heat map from probabilistic shadow matching has so many local maxima and because the GNSS fix can have such large outliers, we cannot use standard techniques such as the Kalman filter or the extended Kalman filter, which rely on the tracked probability distribution being well approximated by a bell-shaped Gaussian distribution. The particle filter allows us to approximate arbitrary distributions, at the expense of higher complexity, and this is where our server infrastructure comes in.

Figure 7: The location estimate obtained as the weighted centroid of the hotspot provided by the particle filter often corrects very large GPS errors. The uncertainty radius (white circle) for improved GPS is based on the spread of the particle set, and is often a more realistic measure than the small uncertainty radius (black circle) typically reported for raw GPS even when the actual position errors are large.

From signal processing to software at scale

The combination of particle filtering and ray tracing introduces complexity to the back-end server ecosystem, making for a very stateful service.

Figure 8: Uber’s GPS improvement system is composed of a particle filter service, 3D map tile management service, a manager service, Uber HTTP API, and cloud storage, and integrates with other Uber services.

There are two kinds of state at play: per-user particle filter state and per-region 3D maps used for ray tracing. The use of particle filters necessitates a level of server affinity. Each new request to our service must be routed to the same back-end server for processing in order to update the correct particle filter. Additionally, due to the large size of 3D maps, each back-end server can only hold a few small sections of the 3D world in RAM.

Since each server can only hold a few square kilometers of map data, not all servers are capable of serving all users. Essentially, implementing the back-end systems for our solution necessitated the creation of a sticky session routing layer that takes server 3D map state into account. In addition to internal tests and performance evaluations, we also run spot checks on our own Android devices using an internal version of the Uber rider app, as illustrated in Figure 9, below:

Figure 9: Red dot/blue dot comparison on our internal version of the rider app allows Uber employees to spot check our solution anywhere in the world.

Moving forward

Accurate estimation of rider and driver location is a crucial requirement for fulfilling Uber’s mission of providing transportation as reliable as running water, everywhere, for everyone.To meet our mission, the Sensing, Intelligence, and Research team is working on a variety of approaches for improving location with creative use of sensors and computation on mobile devices, coupled with the computational power of our server infrastructure. The combination of advanced signal processing, machine learning algorithms, and software at scale has huge potential, and we are always looking for talented and highly motivated individuals (software and algorithms engineers, data visualization engineers, and machine learning engineers) to join us to help realize this potential.

Danny Iland, Andrew Irish, Upamanyu Madhow, & Brian Sandler are members of Uber’s Sensing, Inference and Research team. Danny, Andrew, and Upamanyu were part of the original group that led this research at the University of California, Santa Barbara. After spinning this work into a startup, they demonstrated server-based particle filtering for location improvement in San Francisco using a 3D map constructed from publicly available aerial LiDAR data. They joined Uber in July 2016.

Danny Iland is a senior software engineer on Uber’s Sensing, Inference, and Research team.

Andrew Irish is a senior software engineer on Uber’s Sensing, Inference, and Research team.

Upamanyu Madhow is a researcher at Uber and a professor of Electrical and Computer Engineering at the University of California, Santa Barbara.

Brian Sandler was a summer intern on Uber’s Sensing, Inference, and Research team and is currently a Ph.D student with the University of Pennsylvania.

XS7 – A compact JavaScript engine for embedded applications

$
0
0

Copyright 2017 Moddable Tech, Inc.
Presented May 24, 2017 to Ecma International's TC-39
By Patrick Soquet

Introductory note: This document is the script for a presentation given to the ECMAScript (JavaScript) language committee to introduce XS, the JavaScript engine by Moddable Tech. The talk was given by Patrick Soquet and Peter Hoddie. This document is Patrick's portion of the talk, providing background on the XS engine and technical details on techniques used in the implementation of XS to minimize memory use and code size, while maintaining near full conformance with the evolving JavaScript language specification.

Name and Number

There are clubs, deejays, energy drinks, even a superhero named XS, all in the sense of You Gotta Say Yes to Another Excess. Fifteen years ago, when I began to work on XS, I kind of liked that quote. Maybe because I thought it was bold to develop an ECMAScript engine for consumer devices!

But the name mostly refers to this:

The number is of course the edition of the ECMAScript specifications that the implementation claims to be compatible with.

History

At the beginning of this century, Peter and his friends were creating a media oriented software platform for consumer devices. I joined them to add a scripting engine. My experience was a language named Key I designed and developed for building CD-ROMs. Maybe somebody still remembers CD-ROMs? But Peter wanted a standard language, so we selected ECMAScript.

Some characteristics of my original language remain in XS, For instance the way XS communicates with its debugger, xsbug.

XS3

The first version of XS was roughly based on the third edition of the specifications. I said roughly not because the implementation was incomplete, but because XS3 extended the language with XML in order to describe prototypes hierarchies, properties attributes, frameworks packages, etc.

XS progressively abandoned such extensions as equivalent features entered the standard. For instance, with the fifth edition, XS used the property descriptors and the related features of Object to create and define prototypes and properties; and, of course, with the sixth edition, classes and modules to package frameworks.

Back to the early days. Our main customer then was Sony and XS3 found its way into several of their consumer devices. One of the most interesting is the e-book reader PRS-500:

The XS tool chain was already in place to deliver the e-book reader system: a byte code compiler and packager, a graphical debugger and, last but not least, a simulator. Indeed XS is always available on macOS and Windows for the purpose of prototyping frameworks and applications.

XS5

In 2010 and 2011, I implemented the fifth edition of the specifications. At that time our software platform was mostly used to build mobile apps on iOS and Android. Our frameworks were implemented in C and interfaced in ECMAScript. All applications were scripts that XS5 compiled and packaged to run on phones and tablets.

The XS tool chain allowed us to release self-contained applications without their sources. And our frameworks delivered performances almost indistinguishable from completely native applications.

XS6

In 2014 and 2015, I implemented the sixth edition of the specifications. We had been acquired by a semiconductor company and our focus was on one hand to build frameworks and applications for their customers, and on the other hand to promote their chips thru products for "makers".

So XS6 has been used privately into several appliances and peripherals, and publicly into two construction kits, Kinoma Create and Kinoma Element, which was our first micro-controller based device.

XS6 was also the first open source release of XS. That was necessary because customers would never use our software platform without being able to access its sources. And for me, that was significant: I learned a lot from the sources of other ECMAScript engines and I was happy for XS to become eventually available to everybody.

XS7

Peter already presented where we are now so I can skip the rest...

I told you that story to give you some context:

  • It is obvious today with micro-controllers, but the target of XS has almost always been devices with limited memory and performance.
  • Out of curiosity, I explored browser based frameworks and applications, node.js, etc. However, XS does not come from the web, neither from the client side nor from the server side.
  • I have been helped a lot to debug, port and optimize XS. And of course XS would be different without the suggestions of my colleagues. But put the blame on me if the way XS implements ECMAScript is terrible!

Conformance

Before I explain the foundations of XS and the runtime model we use for embedded development, I want to mention that XS supports the traditional runtime model.

On Linux, macOS and Windows, XS can be built into a command line tool that executes modules, scripts in strict mode, and scripts in non-strict mode. That is how XS goes thru the Test262 conformance suite.

So the number 7 means that XS now conforms to the 7th edition of the specifications. Currently XS7 passes 99% of the language tests (18402/18544) and 97% of the built-ins tests (24143/24804).

Note: The above data is from the date of the talk. The most recent conformance data is here.

As expected there are still bugs! But I am pleased with these results, especially since I understand why such or such test fails... Failures are explained in a document distributed with XS. For instance:

  • XS depends too naively on PCRE for parsing and executing regular expressions, and on the C standard library for computing dates.
  • XS does not store the source code of functions so Function.prototype.toString fails.
  • Strings are UTF-8 encoded C strings, their length is the number of Unicode characters they contain and they are indexed by Unicode characters.
  • XS does not implement yet JSON revivers and replacers.

In these results, tests related to proposals beyond the 7th edition are skipped. Such tests are helping me to prepare XS8...

Foundations

XS exists for a while and its foundations are still relevant today on micro-controllers. Let me introduce a few of them, which will also define the vocabulary I use when I talk about XS.

Machine

The main XS runtime structure is the machine. Each machine is its own ECMAScript realm and has its own heap, stack, keys, names table, symbols table, connection to the debugger, etc.

XS can run several machines concurrently in several threads but one machine can only run in one thread at a time. XS7 provides a C programming interface for machines to communicate.

Slot

The most ubiquitous XS runtime structure is the slot. Slots store booleans, numbers, references, etc.

The size of a slot is four times the size of a pointer, so 16 bytes on 32-bit processors.

  • The first field of a slot is a pointer to the next slot. Slots are mostly used as linked lists. For instance objects are linked lists of properties.
  • The second field of a slot is its id, an index into the keys of the machine that owns the slot. For instance properties and variables have an id.
  • The third field of a slot is its flag, with for instance bits for configurable, enumerable and writable. The garbage collector also uses the flag to mark the slot.
  • The fourth field of a slot is its kind. It defines what is in the fifth field, the value of the slot. For instance if the kind is number, the value contains a double.

In the machine heap, there is a linked list of free slots, using the next field of the slots. The slot allocator remove slots from that list. The garbage collector sweeps unreachable slots by adding them to that list.

Chunk

What does not fit into a slot, or a list of slots, goes into a chunk. Chunks store byte codes, strings, typed arrays, etc. Chunks are always accessed thru slots. The kind field of a slot that points to the chunk defines what the chunk contains.

The size of chunks varies.

The first four bytes of a chunk is the size of its data. The garbage collector uses the high-order bit of the size to mark the chunk.

In the machine heap, chunks are allocated inside a memory block. In order to compact the memory block, the garbage collector sweeps unreachable chunks by relocating reachable chunks, then update slots that point to them with their relocated addresses.

Fragmentation is a real issue on micro-controllers with limited memory and no memory management unit to cope, so relocatable chunks are a blessing.

XS allows frameworks and applications to use chunks for their own data, thanks to a specific kind of slot that references a chunk and hooks for the garbage collector. Such a slot works like a Handle, for those who remember the original Macintosh operating system.

Byte Code

At the core of the XS runtime, there is a huge C function that is a byte code interpreter. The C function uses computed gotos to dispatch byte code, which is always faster than a loop and a switch, especially because of branch prediction. For details, see for instance Eli Bendersky, Computed goto for efficient dispatch tables

The primary objective of byte code is of course to encode modules and scripts into something that is fast enough to decode. XS compiles modules and scripts into byte codes to handle their relative complexity once and only once. Then XS only runs byte code. Byte code is the only information exchanged between compile and run time.

The secondary objective of byte code is compression. I always try to find a balance between the vocabulary (the number of byte codes available to encode ECMAScript constructs) and the grammar (the number of byte codes necessary to encode ECMAScript constructs). The design of the byte code evolves with each version of XS.

Currently there are 194 different byte codes. Most language constructs generate few byte codes, except the explicit or implicit usage of iterators, which I need to factorize otherwise.

In fact most byte codes do not look at all like assembly instructions. Most byte codes directly refer to ECMAScript expressions and statements. For instance there are byte codes to get and set properties, byte codes for unary and binary operators, byte codes for primitive values, etc.

In that sense, XS is an ECMAScript virtual machine, like an imaginary ECMAScript processor.

A lot of byte codes have no values and take just 1 byte. Most byte codes with values have variations depending on the size of their values. A few examples:

  • the integer byte code has variations for 1, 2 and 4 bytes values,
  • the branches byte codes have also variations for 1, 2 and 4 bytes values, depending on how far the interpreter has to jump.
  • since ECMAScript functions with more than 255 arguments and variables are rare, related byte codes have variations too, so most accesses and assignments take only two bytes.

Such variations may seem like a detail but every byte matters!

Stack

The byte code interpreter is stack based. Byte codes push, pop or transform references and values to, from or on the machine stack. The machine stack is a stack of slots, separate from the C stack.

When calling an ECMAScript function, XS does not use the C stack. The byte code interpreter uses the machine stack to remember where to go back and jumps to the first byte code of the ECMAScript function.

That technique allows XS to run intricate enough ECMAScript code despite the typically tiny C stack of micro-controllers.

That technique has other benefits. For instance tail call optimization is straightforward.

And generators only have to store part of the machine stack in order to yield.

However XS implements built-in functions in C and modules can implement their functions in C too. So the byte code interpreter is re-entrant, in order to handle calls from and to such host functions.

Scopes

At compile time, XS parses modules and scripts into syntax trees, then hoists definitions and variables, then scopes identifiers, then generates byte code.

The first objective of scoping identifiers is to access and to assign variables by index, on the stack, directly or indirectly in closures, instead of having to lookup identifiers. That is for performance.

The second objective of scoping identifiers is to create closures with only the necessary variables, since enclosing functions know what their enclosed functions use. That is for memory.

In strict mode, when there is a direct call to eval, XS creates closures for everything. For the evaluated string, XS generates different byte codes to lookup identifiers.

In non-strict mode, when there is a direct call to eval or a with statement, it is similar, except that XS also generates different byte codes to lookup identifiers outside the direct call to eval or inside the with statement.

XS in C

XS provides two C programming interfaces: one for platforms, one for applications.

Platform Programming Interface

XS is written in C, not C++, and primarily relies on constants and functions from the C standard library. Such constants and functions are accessed thru macros so platforms can override them when necessary.

Then a platform has a few functions to implement:

  • to allocate the memory blocks used by slots and chunks,
  • to connect to, communicate with and disconnect from the debugger,
  • to find modules,
  • to defer the execution of promises jobs.

It seems simple enough because it is. That helped XS to be ported on plenty of hardware and software platforms.

However, some peculiar platforms, like the Xtensa instruction set and architecture, on which XS has been ported by Peter, still require some significant work.

Platforms can extend the machine structure itself. Since the machine is passed to all XS functions, it is a convenient way for platforms to have their own context in addition the application context.

Application Programming Interface

XS has an extensive C programming interface for applications to create and delete machines, to execute modules and scripts, to store and retrieve contexts, to collect garbage, etc.

And, mostly, XS allows applications to implement ECMAScript functions in C. That is how frameworks are built, as collections of modules that extend the host for such or such tasks.

With XS, there is no such thing as a generic host. Applications describe the modules they need in their manifest and XS provides tools to assemble a specific host.

Modules tell XS that such or such function is implemented in C with an intentionally "alien" syntax.

export default class Out {
    print(it) @ "Out_print";
    printLine(it) {
        this.print(it + "\n");
    }
}

The syntax allows modules to be hybrid, implemented partially in C and partially in ECMAScript, and to use arbitrary C identifiers for functions. It works for functions, methods, getters and setters. The parameter list must be simple.

Based on that, when assembling a host, XS tools generate what it is necessary to automatically bind C functions to Function objects.

The C implementation itself uses macros to convert from and to ECMAScript values, to access arguments and variables, to get and set properties, to call functions, etc.

void Out_print(xsMachine* the) {
    printf("%s", xsToString(xsArg(0)));
}

For the sake of performance, the API uses the machine stack almost like the byte code interpreter. The macros adorn the API to simplify its usage and to check boundaries.

xsbug

To have a source level debugger is of course essential for any language runtime. XS always provided xsbug, a stand-alone debugger, on macOS and Windows. I built the latest version of xsbug using XS7 itself and a thin user interface framework.

xsbug is a TCP/IP server, waiting for the connection of XS machines. xsbug can debug several machines in parallel, each machine being a tab in its window. XS machines can be TCP/IP clients directly, or indirectly thru a serial connection to the PC and a serial to TCP/IP bridge.

The protocol between xsbug and machines is an XML conversation, somewhat similar to the original protocol of XMPP.

xsbug sends messages to break, run, step, set and clear breakpoints.

<set-breakpoint path="/Users/ps/Projects/moddable/examples/piu/balls/main.js" line="25"/><go/>

Machines send messages to log, report exceptions, inspect the heap and the stack.

<xsbug><log>Hello, xsbug!</log></xsbug>

Machines also instrument themselves, measuring heap and stack size, counting garbage collections, keys used, modules loaded, etc. Machines send samples periodically to xsbug.

<xsbug><samples>0,0,0,0,0,0,0,0,0,1304,2032,2064,8192,0,4096,0,0,12</samples></xsbug>

There is a C programming interface for the instrumentation, so modules can also instrument themselves. The available instruments are reported at the beginning of the conversation.

<xsbug><instruments><instrument name="Pixels drawn" value=" pixels"/><instrument name="Frames drawn" value=" frames"/><instrument name="Network bytes read" value=" bytes"/><instrument name="Network bytes written" value=" bytes"/><instrument name="Network sockets" value=" sockets"/><instrument name="Timers" value=" timers"/><instrument name="Files" value=" files"/><instrument name="Poco display list used" value=" bytes"/><instrument name="Piu command List used" value=" bytes"/><instrument name="Chunk used" value=" / "/><instrument name="Chunk available" value=" bytes"/><instrument name="Slot used" value=" / "/><instrument name="Slot available" value=" bytes"/><instrument name="Stack used" value=" / "/><instrument name="Stack available" value=" bytes"/><instrument name="Garbage collections" value=" times"/><instrument name="Keys used" value=" keys"/><instrument name="Modules loaded" value=" modules"/></instruments></xsbug>

xsbug uses the instruments and samples to display the evolution of the machine while it is running or at every step.

Embedded Development

The challenges of embedded development are obvious: limited memory and limited performance. Compared to hardware that usually runs ECMAScript on the client or the server sides, the differences are measured in orders of magnitude, not percentages: kilobytes instead of gigabytes, megahertz instead of gigahertz, single core instead of multiple cores...

While RAM is often extremely limited, embedded devices also have ROM which can be flashed from a PC. Not a lot of ROM but perhaps a megabyte or two! The main strategy XS uses to run ECMAScript on embedded devices is to run ECMAScript as much as possible from ROM instead of RAM. Then XS uses several techniques to further reduce the amount of ROM it requires.

Compile

XS always compiles modules or scripts into byte code. The part of XS that parses, scopes and codes modules or scripts and the part of XS that executes byte code are completely separate.

In the traditional runtime model, the two parts are of course chained. For embedded development, the XS compiler runs on a PC and the resulting byte code is flashed into ROM where it is executed.

In fact, most embedded devices cannot afford to parse, scope and code modules or scripts themselves, so that part of XS is often absent, which saves some ROM at the cost of eval, new Function and new Generator.

Link

Having the byte code in ROM is not enough. There are the ECMAScript built-ins. There are the modules that applications need to do something useful, like a user interface framework, a secure network framework, etc. That means a lot of memory to define objects.

In the traditional runtime model, when the application starts, built-ins are constructed then modules are loaded. The resulting objects are created dynamically in RAM. Even with the byte code in ROM, that requires too much RAM on embedded devices.

So XS allows developers to prepare the environment to run their application. On a PC, the XS linker constructs built-ins and load modules as usual, then save the resulting objects as C constants. Together with the C code of XS itself, of built-ins and frameworks, such C constants are built into the ROM image.

Run

When the application starts, everything is ready so the embedded device boots instantaneously. Nothing is ever copied from ROM to RAM so the application runs in a few kilobytes.

Let us explore some interesting details...

The ROM machine

The XS linker creates a machine to construct built-ins and load modules. If modules bodies call no host functions, they can even be executed to construct the module objects. Applications declare the modules to prepare that way in their manifest.

The heap, stack, keys, names, symbols and modules table of that machine are literally dumped into C constants together with the byte code and references to host functions. Here are some byte codes of a module, and some slots of the heap that define an ECMAScript function that uses such byte code.

static const txU1 gxCode1[1023] = {
    0x45, 0x81, 0x9a, 0x66, 0x00, 0x16, 0x48, 0x81, 0x9a, 0x28, 0x03, 0xc1, 0x0a, 0x00, 0x83, 0x08,
    //...
    0x03, 0xb3, 0xad, 0x81, 0x91, 0x76, 0x76, 0x5c, 0x03, 0xb3, 0x5c, 0x06, 0x69, 0x89, 0x39
};

static const txSlot gxHeap[mxHeapCount] = {
// ...
/* 1804 */  { (txSlot*)&gxHeap[1805], XS_NO_ID, 0x92, XS_INSTANCE_KIND, { .instance = { NULL, (txSlot*)&gxHeap[756] } } },
/* 1805 */  { (txSlot*)&gxHeap[1806], -32346, 0x8f, XS_CODE_X_KIND, { .code = { (txByte*)(&gxCode1[44]), NULL } } },
/* 1806 */  { (txSlot*)&gxHeap[1807], XS_NO_ID, 0x8f, XS_HOME_KIND, { .home = { (txSlot*)&gxHeap[1781], (txSlot*)&gxHeap[1778] } }},
/* 1807 */  { NULL, -32508, 0x8e, XS_REFERENCE_KIND, { .reference = (txSlot*)&gxHeap[1781] } },
// ...
};

The RAM machine

The XS runtime creates a machine to run the application. Except for a few tables that are inherently live, the dynamic RAM machine is initialized by pointing to the static ROM machine.

Here are a few examples:

  • When the application invokes built-in or prepared modules constructors, the constructors are in ROM and the prototypes of the results are also in ROM.
  • When functions are created on the fly, which is increasingly common due to arrow functions, the Function objects are in RAM but their byte code stays in ROM.
  • String constants and literals are also created from ROM instead of RAM: the string slots point to C constants instead of chunks.

So the heap can be extra small. This helps the simple garbage collector of XS to be efficient since a big "generation" never has to be marked and swept.

Aliases

What happens when applications want to modify objects that are in the ROM machine?

The RAM machine has a table of aliases which is initially empty. All aliasable objects in ROM have an index into that table. When applications modify aliasable objects, an instance is inserted into the table to override the aliasable object.

Such a mechanism has a cost in memory and performance, so prepared modules can use Object.freeze in their body to tell the XS linker that such or such object do not need to be aliasable.

In fact, thanks to subclassing, it is practical and healthy to freeze most functions and prototypes.

Future

Obviously there will be XS8 :-) Two proposals are especially relevant to our target.

The async function and await expression would help applications to cleanly define the typical microcontroller setup and loop: configuring sensors, reading sensors, uploading results, and so on. That is essentially asynchronous.

Microcontroller with multiple cores are there, and XS could create several tiny RAM machines based on one huge ROM machine to run them in parallel. So the atomics and shared memory could become relevant, even on embedded devices.

On my side, I will continue to refactor the code of XS. Implementing Proxy and Reflect a few years ago already drove me in the right direction but there is of course much to do.

Also I am still exploring the possibilities offered by the linking process I just described. There could be some significant optimizations to do there, for instance by further modifying the byte code at this stage.

Thank you!


The first image on the very first Macintosh

$
0
0

Burrell Smith liked to do intensive design work over the Christmas break, so the very first prototype of the very first Macintosh sprung to life early in the first month of the new decade, in January 1980. It wasn't really a stand-alone computer yet, as the prototype resided on an Apple II peripheral card, but it already contained the essential hardware elements of Jef Raskin's Macintosh dream: a Motorola 6809E microprocessor, 64K of memory, and a 256 by 256 bit-mapped graphic frame buffer, which was hooked up to a cute, tiny 7 inch black and white display. Burrell used the Apple II host to poke values into the memory of the prototype, so he could initialize the control registers and run small programs with the 6809.

I went out to lunch with Burrell a few weeks later and, knowing my appreciation for Woz-like hardware hacks, he explained the crazy way that he contrived for the Apple II to talk with the prototype. He didn't want to waste time designing and wiring up hardware to synchronize the memory of the two machines, since that wouldn't be needed by the real product. Instead, he delegated the memory synchronization to the software, requiring the Apple II to hit a special memory address to tell the prototype how many microseconds later to grab data off of the common data bus. It was weird enough to make me interested in seeing if it really worked.

By now, Burrell thought that he had the graphics running properly, but he wasn't really sure; he still needed to write some software to try it out. I told him that I'd look into it when I had some time. He gave me a copy of a handwritten page that contained the magic addresses that I'd have to use, hoping that I'd get around to it soon.

I was used to coming back to the lab at Apple after dinner, to see if anything interesting was going on and working on various extra-curricular projects. I had some spare time that night, so I got out Burrell's instructions and wrote an Apple II (6502) assembly language routine to do the necessary bit-twiddling to transfer whatever was on the Apple II's hi-res graphic display to the Mac prototype's frame-buffer, using Burrell's unusual synchronization scheme.

One of my recent side projects involved using Woz's new, one-to-one interleave floppy disk routines to make very fast slideshow disks on the Apple II. I had just made one full of Disney cartoon characters, that were scanned by Bob Bishop, one of the early Apple software magicians. Bob adored the work of Carl Barks, the Disney artist who specialized in Donald Duck, and he had scanned dozens of Barks' Donald Duck images for the Apple II. I selected an image of Scrooge McDuck sitting on top of a huge pile of money bags, blithely playing his fiddle, with a big grin on his beak. I'm not sure why I picked that one, but it seemed to be appropriate for some reason.

Even though it was starting to get late, I was dying to see if my routine was working properly, and it would be very cool to surprise Burrell when he came in the next day with a detailed image on the prototype display. But when I went to try it, I noticed that Burrell's Apple didn't have a disk controller card, so there was no way to load my program. Damn! I couldn't shut the computer down to insert the card, because I didn't know how to reinitialize the Macintosh board after power-up; Burrell hadn't left the magic incantation for that. I thought I was stuck, and would have to wait until Burrell came in tomorrow.

The only other person in the lab that evening was Cliff Huston, who saw the trouble I was having. Cliff was another early Apple employee, who was Dick Huston's (the heroic programmer who wrote the 256-byte Apple II floppy disc boot ROM) older brother and an experienced, somewhat cynical technician. I explained the situation to him and was surprised when he started to smile.

Cliff told me that he could insert a disk controller card into Burrell's Apple II with the power still on, without glitching it out, a feat that I thought was miraculous - you'd have to be incredibly quick and steady not to short-circuit any of the contacts while you were inserting it, running the risk of burning out both the Apple II and the card. But Cliff said he'd done it many times before: all that was required was the confidence that you could actually do it. So I crossed my fingers as he approached Burrell's Apple like a samurai warrior, concentrating for a few seconds before holding his breath and slamming the disk card into the slot with a quick, stacatto thrust.

I could barely make myself look, but amazingly enough Burrell's machine was still running, and the disk booted up so I could load the Scrooge McDuck image and my new conversion routine. And even more suprising, my routine actually worked the first time, displaying a crisp rendition of Uncle Scrooge fiddling away on the Mac's tiny monitor. The Apple II only had 192 scan-lines, while the embryonic Macintosh had 256, so I had some extra room at the bottom where I rendered the message "Hi Burrell!" in a nice-looking twenty-four point, proportional font.

By the time I came in the next morning, an excited Burrell had already showed the image to everyone he could find, but then he accidentally reset the prototype somehow, and didn't know how to get the image back on the screen. I loaded it again so he could show it to Tom Whitney, the engineering VP. I think Jef was pretty pleased to see his new computer start to come alive, but I don't think he was very happy about me giving the demo, since he thought I was too much of a hacker, and I wasn't supposed to be involved with his pet project.

Viewing all 25817 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>