Automated Data Pipelines for Real-Time Analytics in Big Data Ecosystems

Thrity Umrigar

doi:10.15662/IJARCST.2023.0601002

Authors

Thrity Umrigar IPS Academy Institute of Engineering & Science, Indore, India Author

DOI:

https://doi.org/10.15662/IJARCST.2023.0601002

Keywords:

Automated Data Pipelines, Real-Time Analytics, Big Data, Stream Processing, Workflow Orchestration, Apache NiFi, Kafka, Flink, Airflow, Druid, Pre-2022 Research

Abstract

Automated data pipelines are critical for enabling real-time analytics in Big Data ecosystems. These pipelines support continuous ingestion, transformation, and delivery of streaming and batch data to drive timely insights. This review examines the state-of-the-art up to 2022, focusing on scalable tools, architectural patterns, and automation strategies. We analyze open-source platforms such as Apache NiFi (for flow-based ETL), Kafka (highthroughput messaging), Flink (stream and batch processing), and Airflow (workflow orchestration), alongside real-time analytical stores like Apache Druid. Workflow orchestration tools including Prefect, Dagster, and MLRun demonstrate growing sophistication in handling dynamic pipelines. Studies like H-STREAM exemplify microservice-based stream pipelines, while AI-enhanced pipelines optimize data quality and processing performance. Our methodology includes systematic literature review and tool ecosystem mapping. Key findings highlight the centrality of event-driven, microservice architectures, unified stream-batch engines, and observability for real-time demands. The automated pipeline workflow typically encompasses ingestion (e.g., Kafka), orchestration (e.g., Airflow), processing (e.g., Flink), storage and analytics (e.g., Druid), with quality control integrated via AI techniques. Advantages include speed, scalability, and reduced manual intervention; disadvantages involve operational complexity and steep learning curves. We conclude that real-time pipelines have matured but require further work on automation, data reliability, AI-based optimization, and governance. Future work should prioritize self-healing pipelines, unified orchestration layers, and stronger integration with ML workflows.

References

1. Apache NiFi overview (flow-based ETL automation) Wikipedia.

2. Apache Kafka for real-time, high-throughput messaging Data Stack HubByteHouse.

3. Apache Flink: unified stream/batch processing, event time, low latency WikipediaData Stack Hub.

4. Apache Airflow: workflow orchestration with Python-based DAGs WikipediaEstuary.

5. Apache Druid: real-time OLAP engine for analytics dashboards Wikipedia.

6. Tool ecosystem overviews including Prefect, Dagster, MLRun MediumAirbyte.

7. H-STREAM microservice stream pipeline framework arXiv.

8. AI-driven data quality and stream optimization results

Automated Data Pipelines for Real-Time Analytics in Big Data Ecosystems

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Make a Submission

images

Menu

Manuscript Submission

Keywords

Keywords

Latest publications

Information