Saturday, March 5, 2016

LinkedIn open-sources WhereHows, a metadata management tool


LinkedIn-pen


LinkedIn today announced that it’s open-sourcing a piece of its software called WhereHows, which allows anyone in a company to learn about and share information on data that company has under management. The software is now available on GitHub under an open-source Apache license.


LinkedIn has many systems for storing and processing data, including Teradata’s data warehousing technology, the open source Hadoop distributed file system, the open source Hive data warehousing software, and its own open source Pinot real-time analytics software. It’s not trivial to know exactly where a kind of data lives. WhereHows can help with that, because it lets people run wide-ranging searches across everything, and people can post about the data for which they have knowledge.


Rather than viewing data, WhereHows lets people track the specific types of data that are available. In other words, it’s a tool for discovering and managing metadata. WhereHows is available to people at LinkedIn in the form of a user interface and an application programming interface (API) for developers. It serves up information on more than 25,000 publicly shared data sets from HDFS alone. It also takes into consideration flows of data through multiple tools; so, for example, it surfaces 150,000 flows from its open source job scheduler. But instead of LinkedIn keeping the software to itself, the company is opening up and sharing it for other companies with complex systems to use and even build on.


“We are open sourcing WhereHows on GitHub, as well as our discussion group, to share our work with the broader data community,” LinkedIn staff data engineer Eric Sun wrote in a blog post. “We highly encourage contributors from different companies to create new features and commit important bug fixes. Though metadata management tends to be tightly coupled to other components in the company, we will continue to try to refactor LinkedIn-internal integrations into WhereHows into generic templates or plugins in open source.”


This is hardly LinkedIn’s first open source contribution. Pinot became available last year, and before that, there were Azkaban, Kafka, Samza, and Voldemort.


But data discovery, or the data catalog, is a whole other type of software. Many proprietary tools are available. For instance, startup Tamr came out with something last year. So the WhereHows release could be a big deal for companies with complex data infrastructures. In return, LinkedIn could easily find people willing to improve the technology and maybe even join the company’s ranks.


LinkedIn wants to enhance the software by giving it integration with tools like Kafka, Samza, Gobblin, and Nuage, and it could also add in information on joins between different types of data, wrote Sun.


Documentation for all parts of WhereHows is here.


More information:

via LinkedIn open-sources WhereHows, a metadata management tool