A Data Professionals Community

How Uber Uses Spark and Hadoop to Optimize Customer Experience

If you’ve ever used Uber, you’re aware of how ridiculously simple the process is.


You press a button, a car shows up, you go for a ride, and you press another button to pay the driver. But there’s a lot more going on behind the scene, and much of that infrastructure increasingly runs on Hadoop and Spark, as the Uber data team recently shared.

Uber has the envious position of sitting at the junction of the digital and physical worlds. It commands an army of more than 100,000 drivers who are tasked with moving people and their stuff within a city or a town. That’s a relatively simple problem. But as Uber’s Head of Data Aaron Schildkrout recently said, that simplicity of its business plan gives Uber a huge opportunity to use data to essentially perfect its processes.

“It’s fundamentally a data problem,” Schildkrout says in a recording of a talk that Uber did with Databricks recently. “Because it is so simple, we sort of get to the essence of what it means to automate an experience like this. In a sense we’re trying to bring intelligence, in an automated and basically real time way, to cars that all over the globe right now that are carrying people around, to make that happen at this tremendous scale.”

Whether it’s calculating Uber’s “surge pricing,” helping drivers to avoid accidents, or finding the optimal positioning of cars to maximize profits, data is central to what Uber does. “All these data problems…are really crystalized on this one math with people all over the world trying to get where they want to go,” he says. “That’s made data extremely exciting here, it’s made engaging with Spark extremely exciting.”

Big Data at Uber

In the Databricks talk, Uber engineers described (apparently for the first time in public) some of the challenges the company has faced in getting its back-office applications to scale, and what it’s done to meet that demand.Uber_1

Spark has been “instrumental in where we’ve gotten to,” says Vinoth Chandar, who’s in charge of building and scaling Uber’s data systems. Under the old system, Uber relied on Kafka data feeds to bulk-load log data into Amazon S3, and used EMR to process that data. It then moved the “gold-plated” output from EMR into its relational warehouse, which is accessed by internal users and the city-level directors leading Uber’s expansion around the world….Continue Reading


This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More