Azure Data Factory is a no-code solution that's great for simple DAGs. However, it is not suitable for anything more entrerprise-y. For instance, if you're copying large amounts of data from one database to another and there is a failure, there is no scope for cleaning up the data. The first you'll know of it (because nobody ever reads the logs) is when the users are complaining that the numbers coming out of their SQL queries are wrong.
ADF doesn't even allow nested ForEach loops! It stores pipelines as JSON so watch it get confused when the pipeline itself contains JSON!
Don't give people raw SQL access. They can make profound changes and you'll have no logs and little ability to correct it.
Don't give people raw SQL access. They can make profound changes and you'll have no logs and little ability to correct it.
Everything needs to have an automated audit log. There's going to be large number of questions like: "why did it do this?" It's never simply a matter of success or fail. There is huge nuance - eg, type conversion may mean what was in the source system is not exactly the same as in the destination system. Is this a pass or a fail?
Processes need orchestrating. One process reads a dataset while another deletes it/writes to it causing nondeterministic errors. You get similar issues when you issue a cancel.
Communication - docs are always changing. Be aware that most knowledge is tribal, not recorded.
Scheduling: everything was based on time. So, it's possible two daily jobs were running at the same time if the first took more than 24 hours. Data migrating from one DB to another within a cloud zone and subscription was at a rate of ~10mb/s. This meant some tables took hours. And it didn't scale. As the project progressed, more tables were to be migrated. But this caused some jobs to take days. So, the weekly transfer was still going on when people came to the office on Monday morning. Consequently, their queries were returning inconsistent results - the silent killer.
The metadata must be constrained. If it lives in a database, it must have referential integrity. This can be overlooked because it's not business data. But if you want to eliminate mysterious errors, it's essential to get this right.
Like water finding its natural level, people will naturally gravitate to the path of least resistance. As a general rule, teams do not follow best practises, naming conventions or industry standards. This is not strictly true but it's so common that you must assume for all teams that their data and code is a Rube Goldberg Machine.
Regular meetings with major stakeholders. These meeting can be brief (about 15 minutes is OK once the cadence is established) but they do need to be frequent (at least twice a week, preferably daily).
No comments:
Post a Comment