what are the steps to build a data pipeline from scratch
Thanks for reaching out.
A data pipeline is an automated process that helps you obtain, move, and transform (and sometimes visualise) your data.
Therefore, today this includes anything from planning how you will obtain your data from a cloud service or somewhere else, how will you store it on a database, how you will analyse it or how you will apply various services and techniques on it to understand it better, all the way to finally analysing and visualising it.
Therefore, a we use a data pipeline to answer a chain of questions that will relate to the solution of a certain business problem. The technical tools (e.g. AWS, Python, SQL, Big Data Analytics, Tableau etc.) are the means that will help us answer your questions, and typically one has a relatively large freedom in choosing the means.
That said, the steps normally start from an abstract perspective and end up with the specific ways/means with which the analytic goals can be obtained:
a) Figure out what do you want information you need in order to obtain a certain business/analytic goal
b) Think of what type of data you need to achieve this goal
c) Look for various data sources that can provide you with such type of data
d) Abstractly, create the chain that will take you from c) to a)
e) Work on discovering (and obtaining) the means that will actualise d).
Of course, one could start from c), if they were to be endowed with a certain stream of data, and then try go figure out what this data can be used for.
However, what is important is to be able to constantly look at the bigger picture, clarify the entire chain (from obtaining to visualising data), and finally find the means to automate this process.
To conclude, this is rather a potential suggestion of the steps to be taking and there may be a lot more freedom into defining these steps, particularly for a certain sphere or experience. Feel free to add to this comment if you wish.
Hope this helps.