In data pipelines, "upstream" refers to the stages or processes that produce data which is consumed by subsequent stages. "Downstream" refers to the stages or processes that consume data produced by earlier stages.
-
Upstream: Processes that generate or transform data before it is used in downstream processes. For instance, an ETL process that extracts and transforms data is considered upstream of the loading process into a data warehouse.
-
Downstream: Processes that depend on the output of upstream processes. For example, data visualization tools that generate reports based on the data loaded into the warehouse are downstream processes.
Example Code: Assuming a data pipeline using Apache Airflow:
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def extract():
# Upstream task: Extract data
pass
def transform():
# Transform data
pass
def load():
# Downstream task: Load data
pass
default_args = {
'owner': 'airflow',
'start_date': datetime(2023, 1, 1),
}
dag = DAG('data_pipeline', default_args=default_args, schedule_interval='@daily')
start = DummyOperator(task_id='start', dag=dag)
extract_task = PythonOperator(task_id='extract', python_callable=extract, dag=dag)
transform_task = PythonOperator(task_id='transform', python_callable=transform, dag=dag)
load_task = PythonOperator(task_id='load', python_callable=load, dag=dag)
end = DummyOperator(task_id='end', dag=dag)
start >> extract_task >> transform_task >> load_task >> end