Automated Data Pipelines for Real-Time Analytics in Big Data Ecosystems
DOI:
https://doi.org/10.15662/IJARCST.2023.0601002Keywords:
Automated Data Pipelines, Real-Time Analytics, Big Data, Stream Processing, Workflow Orchestration, Apache NiFi, Kafka, Flink, Airflow, Druid, Pre-2022 ResearchAbstract
Automated data pipelines are critical for enabling real-time analytics in Big Data ecosystems. These pipelines support continuous ingestion, transformation, and delivery of streaming and batch data to drive timely insights. This review examines the state-of-the-art up to 2022, focusing on scalable tools, architectural patterns, and automation strategies. We analyze open-source platforms such as Apache NiFi (for flow-based ETL), Kafka (highthroughput messaging), Flink (stream and batch processing), and Airflow (workflow orchestration), alongside real-time analytical stores like Apache Druid. Workflow orchestration tools including Prefect, Dagster, and MLRun demonstrate growing sophistication in handling dynamic pipelines. Studies like H-STREAM exemplify microservice-based stream pipelines, while AI-enhanced pipelines optimize data quality and processing performance. Our methodology includes systematic literature review and tool ecosystem mapping. Key findings highlight the centrality of event-driven, microservice architectures, unified stream-batch engines, and observability for real-time demands. The automated pipeline workflow typically encompasses ingestion (e.g., Kafka), orchestration (e.g., Airflow), processing (e.g., Flink), storage and analytics (e.g., Druid), with quality control integrated via AI techniques. Advantages include speed, scalability, and reduced manual intervention; disadvantages involve operational complexity and steep learning curves. We conclude that real-time pipelines have matured but require further work on automation, data reliability, AI-based optimization, and governance. Future work should prioritize self-healing pipelines, unified orchestration layers, and stronger integration with ML workflows.
References
1. Apache NiFi overview (flow-based ETL automation) Wikipedia.
2. Apache Kafka for real-time, high-throughput messaging Data Stack HubByteHouse.
3. Apache Flink: unified stream/batch processing, event time, low latency WikipediaData Stack Hub.
4. Apache Airflow: workflow orchestration with Python-based DAGs WikipediaEstuary.
5. Apache Druid: real-time OLAP engine for analytics dashboards Wikipedia.
6. Tool ecosystem overviews including Prefect, Dagster, MLRun MediumAirbyte.
7. H-STREAM microservice stream pipeline framework arXiv.
8. AI-driven data quality and stream optimization results


