A Data Professionals Community

Scalable approach to raw Data ingestion

Naturally, the first step in an enterprise data pipeline involves the source systems and raw data that will ultimately be ingested, blended, and analyzed.


Market experience dictates that the most important big data insights tend to come from combinations of diverse data that may initially be isolated in silos across the organization.

As such, a key need in Hadoop data and analytics projects is the ability to tap into a variety of different data sources, types, and formats. Further, organizations need to prepare not only for the data they want to integrate with Hadoop today, but also data that will need to be handled for potential additional use cases in the future.

The following types of data sources and formats are often part of Hadoop analytics projects:

  • Data warehouses and RDBMs containing transactional customer profile data
  • Log file and event data including web logs, application logs, and more
  • Data in semi-structured formats including XML, JSON, and Avro
  • Flat files, such as those in CSV format
  • Data housed in NoSQL data stores, such as HBase, MongoDB, and Cassandra
  • Data pulled from web-based APIs as well as FTP servers
  • Cloud and on-premise application data, such as CRM and ERP data
  • Analytic databases such as HPE Vertica, Amazon Redshift, and SAP HANA

Organizations are also finding that cost and efficiency pressures as well as other factors are leading them to use cloud-computing environments more heavily. They may run Hadoop distributions and other data stores on cloud infrastructure, and as a result, may need data integration solutions to be cloud-friendly. This can include running on the public cloud to take advantage of scalability and elasticity, private clouds with connectivity to on-premises data sources, as well as hybrid cloud environments. In a public cloud scenario, organizations may look to leverage storage, databases, and Hadoop distributions from an overall infrastructure provider (In the case of Amazon web services, this would mean S3 for storage, Amazon Redshift for analytic data warehousing, and Amazon Elastic MapReduce for Hadoop).

As Hadoop projects evolve from small pilots to departmental use cases and, eventually, enterprise shared service environments, scalability of the data ingestion and onboarding processes becomes mission-critical. More data sources are introduced over time, individual data sources change, and frequency of ingestion can vacillate. As this process extends out to a hundred data sources or more, which could even be a range of similar files in varying formats, maintaining the Hadoop data ingestion process can become especially painful.

At this point, organizations desperately need to reduce manual effort, potential for error, and amount of time spent on the care and feeding of Hadoop. They need to go beyond manually designing data ingestion workflows to establish a dynamic and reusable approach while also maintaining traceability and governance. Being able to create dynamic data ingestion templates that apply metadata on-the-fly for each new or changed source is one solution to this problem. According to a recent best practices guide by Ralph Kimball, “consider using a metadata-driven codeless development environment to increase productivity and help insulate you from underlying technology changes.”3 Not surprisingly, the earlier organizations can anticipate these needs, the better.


This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Accept Read More