What is Data ingestion?

A process of obtaining, importing and analyzing data for later use or storage in a database.

  1. Discovering the data sources
  2. Importing the data
  3. Processing data to produce intermediate data
  4. Sending data out to durable data stores

Data can be streamed in real time or ingested in batches. When data is ingested in real time, each data item is imported as it is emitted by the source. When data is ingested in batches, data items are imported in discrete chunks at periodic intervals of time. An effective data ingestion process begins by prioritizing data sources, validating individual files and routing data items to the correct destination.

When numerous big data sources exist in diverse formats (the sources may often number in the hundreds and the formats in the dozens), it can be challenging for businesses to ingest data at a reasonable speed and process it efficiently in order to maintain a competitive advantage. To that end, vendors offer software programs that are tailored to specific computing environments or software applications. When data ingestion is automated, the software used to carry out the process may also include data preparation features to structure and organize data so it can be analyzed on the fly or at a later time by business intelligence (BI) and business analytics (BA) programs.

High Volume Data Ingestion Solutions
Some High Volume Data Ingestion Solutions include:
  1. Semantic Data Compression. Today’s data compression solutions are being used sparingly (by using a message’s fixed structure or schema only) if at all. The Semantic Data Compression solution takes advantage of knowledge about the structure and context of data streams. This compression yields “leaner” messages and better prioritization of what data is sent.
  2. Cloud-Based Ingestion Gateways. Secure storage/retrieval connections are managed by an ingestion gateway. These gateways have traditionally been a Big Data bottleneck. By virtualization gateways and placing them in a Cloud, Big Data across multiple sources becomes more manageable, improves overall system stability, and eliminates the need for server load-balancing.
  3. Elastic Services. Instead of relying on monolithic applications that do everything, elastic services (or “microservices”) are smaller software applications that may be used independently. Thus, individual parts of the system can be rolled-out, scaled-up, or upgraded as needed, in real time. This prevents delays and allows a smarter allocation of tech resources. 
  4. Scalable Databases. General-purpose databases (as opposed to databases designed to address a particular purpose) scale with need and enable organizations to accommodate future growth (and significantly reduce time/space/financial investment).
  5. On-the-Fly Data Analysis. Traditionally, data analysis consists of a serial process, whereby data is (1) extracted from relevant sources, (2) normalized and cleaned, (3) transferred to a central database, and (4) loaded for use (also known as Extract/Transfer/Load, or ETL). Since these steps do not overlap, there is a considerable lag between extraction and use. More modern methods perform data analysis “on the fly,” whereby data is processed as soon as it is available, while other data is being gathered and prepared. This allows for the use of information in near real-time.
  6. Data Normalization. Data normalization combines variants of the same bit of data, thereby reducing data redundancy (which in turn speeds up transfer and processing), and producing more accurate analytics.

Comments

Popular posts from this blog

GP - Kerberos errors and resolutions

How to set Optimizer at database level in greenplum

GP - SQL Joins