I'm searching for tool for storing documentation about tables, datasources, etl processes and etc for my DWH. I've seen some presentations on youtube, but I've found out, that most of the companies are using custom, own system or something like wiki ith plain text descriptions. I think, that it is not so useful for Analysts, Mangers and other user to find out , what they need and how to use data to calculate suitable for them statistics. Can you suggest, please, what may I use for this case? What I must read?
While Airflow was baked with some support for Apache-Atlas, in my opinion
the one of the best data-lake metadata management tools right now is Lyft's Amundsen
and they've also released lyft/amundsendatabuilder
, the introduction of which says
Amundsen Databuilder is a data ingestion library, which is inspired by Apache Gobblin. It could be used in an orchestration framework(e.g. Apache Airflow) to build data from Amundsen. You could use the library either with an adhoc python script(example) or inside an Apache Airflow DAG(example).