Floods are the most common natural disaster in several countries throughout the world. Flooding has a major impact on people's lives and livelihoods. The impact of flood disasters on human lives can be mitigated by developing effective flood forecasting and prediction models. The majority of flood prediction models do not take all flood-causing factors into account when they are designed. It is difficult to collect and handle some of these flood-causing variables since they are heterogeneous in nature. This paper presents a new big data architecture called Data Lake, which can ingest and store all important flood-causing heterogeneous data sources in their raw format for machine learning model creation. The statistical relevance of important flood producing factors on flood prediction outcome is determined utilizing inferential statistical approaches. The outcome of this research is to create flood warning systems that can alert the public and government officials so that they can make decisions in the event of a severe flood, reducing socioeconomic loss. •Flood causing factors are from heterogeneous sources, so there is no big data architecture for handling variety of data sources.•To provide data architectural solution using data lake for collecting and analysing heterogeneous flood causing factors.•Uses inferential statistical approach to determine importance of different flood causing factors in design of efficient flood prediction models.
Keywords: Data Lake; Flood Prediction; Flood causing vital factors; Inferential Statistics.
© 2023 The Author(s).