All posts by genem

World Class Data Warehouse Starting Under $100/mo

Start with a S3 data lake, add Snowflake and then Nifi as-needed for automated data and file movement flows. (Microphone drop.) Dare I say most may not even need HDFS? Just a cloud-based object store and linearly scalable performance for standard data access paths like ODBC/JDBC, SQL, AMPQ, DAG-based ELT jobs and file extracts. That’s the power of the new class of distributed storage/compute DB platforms like Snowflake (and maybe Redshift/Spectrum).
If you are a true data platform play (health insurance, online media, etc.) then yes you might want to start spinning up a few HDFS clusters if only for the Spark hotness. But 90%+ of organizations will get 95%+ of what they need in an enterprise data warehouse from AWS + Snowflake. And it’s scalable starting <$100/mo. Sign me up. Oh wait, I already did.

I Hate Open Data Portals

Civic Innovations

Well, not really – But I do dislike certain things about most open data portals. Even the ones that I work with every day or that I have been involved with in the past.

Don’t get me wrong – I’m a true believer in the power of open data. I love that every day there are more and more governments posting open data to specialized sites meant to make their data available to external (and, increasingly, internal) users. But there are things about the way that most open data portals are structured and used that bother me – I think we can do better. And I think a lot of people will agree with me.

View original post 1,297 more words

Agile Data Warehouse Modeling: How to Build a Virtual Type 2 Slowly Changing Dimension

Virtualize first, instantiate where needed…

The Data Warrior

One of the ongoing complaints about many data warehouse projects is that they take too long to delivery. This is one of the main reasons that many of us have tried to adopt methods and techniques (like SCRUM) from the agile software world to improve our ability to deliver data warehouse components more quickly.

So, what activity takes the bulk of development time in a data warehouse project?

Writing (and testing) the ETL code to move and transform the data can take up to 80% of the project resources and time.

So if we can eliminate, or at least curtail, some of the ETL work, we can deliver useful data to the end user faster.

One way to do that would be to virtualize the data marts.

For several years Dan Linstedt and I have discussed the idea of building virtual data marts on top of a Data Vault modeled…

View original post 593 more words

Of War and Survivorship Bias

A great read about the failure to look for what is missing in your data set.

“When a company performs a survey about job satisfaction the only people who can fill out that survey are people who still work at the company. Everyone who might have quit out of dissatisfaction is no longer around to explain why.”
     
“I have to chuckle whenever I read yet another description of American frontier log cabins as having been well crafted or sturdily or beautifully built. The much more likely truth is that 99% of frontier log cabins were horribly built—it’s just that all of those fell down.”

You Are Not So Smart

The Misconception: You should focus on the successful if you wish to become successful.

The Truth: When failure becomes invisible, the difference between failure and success may also become invisible.

Illustration by Brad Clark Illustration by Brad Clark at http://www.plus3video.com

In New York City, in an apartment a few streets away from the center of Harlem, above trees reaching out over sidewalks and dogs pulling at leashes and conversations cut short to avoid parking tickets, a group of professional thinkers once gathered and completed equations that would both snuff and spare several hundred thousand human lives.

People walking by the apartment at the time had no idea that four stories above them some of the most important work in applied mathematics was tilting the scales of a global conflict as secret agents of the United States armed forces, arithmetical soldiers, engaged in statistical combat. Nor could people today know as they open umbrellas and…

View original post 6,381 more words

Amazon Redshift Performance & Cost by airbnb

I just ran across this post by airbnb from last year, regarding their testing of Amazon Redshift vs Hive. It includes some good data points on cluster configuration (including dollars) and performance, but more useful is the direct comparison between Hive and Redshift.

Our data pipeline thus far has consisted of Hadoop, MySQL, R and Stata. We’ve used a wide variety of libraries for interfacing with our Hadoop cluster such as Hive, Pig, Cascading and Cascalog. However, we found that analysts aren’t as productive as they can be by using Hadoop, and standalone MySQL was no longer an option given the size of our dataset. We experimented with frameworks such as Spark but found them to be too immature for our use-case. So we turned our eye to Amazon Redshift earlier this year, and the results have been promising. We saw a 5x performance improvement over Hive.