Forget Big Data, Think Mid Data

Tom H.C. Anderson Misc 8 Comments

Stop Chasing the Big Data; Mid Data makes more sense

After attending the American Marketing Association’s first conference on Big Data this week, I’m even more convinced of what I already suspected from speaking to hundreds of Fortune 1000 marketers the last couple of years. Extremely few are working with anything approaching what would be called “Big Data” – And I believe they don’t need to – But many should start thinking about how to work with Mid Data!


“Big Data”, “Big Data”, “Big Data”. It seems like everyone is talking about it, but I find extremely few researchers are actually doing it. Should they be?

If you’re reading this, chances are that you’re a social scientist or business analyst working in consumer insights or related area. I think it’s high time that we narrowed the definition of ‘Big Data’ a bit and introduced a new more meaningful and realistic term “MID DATA” to describe what is really the beginning of Big Data.

If we introduce this new term, it only makes sense that we refer to everything that isn’t Big or Mid data as Small Data (I hope no one gets offended).

Small Data

I’ve included a chart, and for simplicity will think of size here as number of records, or sample if you prefer.

‘Small Data’ can include anything from one individual interview in qualitative research to several thousand survey responses in longitudinal studies. At this level of size quantitative and qualitative can technically be lumped together as neither currently fit the generally agreed upon (and admittedly loose) definition of what is currently “Big Data”. You see, rather than a specific size, the current definition of Big Data is varies depending on the capabilities of the organization in question. The general rule for what would be considered Big Data would be data which cannot be analyzed by commonly used software tools.

As you can imagine, this definition is an IT/hardware vendor’s dream, as it describes a situation where a firm does not have the resources to analyze (supposedly valuable) data without spending more on infrastructure, usually a lot more.

Mid Data

What then is Mid Data? At the beginning of Big Data, some of the same data sets we might call Small Data can quickly turn into Big Data. For instance, the 30,000-50,000 records from a customer satisfaction survey which can sometimes be analyzed in commonly available analytical software like IBM-SPSS without crashing. However, add text comments to this same data set and performance slows considerably. These same data sets will now often take too long to process or more typically crash.

If these same text comments are also coded as is the case in text mining, the additional variables added to this same dataset may increase significantly in size. This then is currently viewed as Big Data, where more powerful software will be needed. However I believe a more accurate description would be Mid Data, as it is really the beginning of Big Data, and there are many relatively affordable approaches to dealing with this size of data. But more about this in a bit…

Big Data

Now that we’ve taken a chunk out of Big Data and called it Mid Data, let’s redefine Big Data, or at least agree on where Mid Data ends and when ‘Really Big Data’ begins.

To understand the differences between Mid Data and Big Data we need to consider a few dimensions. Gartner analyst Doug Laney famously referred to Big Data as being 3-Dimensional; that is having increasing volume, variety, and velocity (now commonly referred to as the 3V model).

To understand the difference between Mid Data and Big Data though, only two variables need to be considered, namely Cost and Value. Cost (whether in time or dollars) and expected value are of course what make up ROI. This could also be referred to as the practicality of Big Data Analytics.

While we often know that some data is inherently more valuable than other data (100 customer complaints emailed to your office should be more relevant than a 1000 random tweets about your category), one thing is certain. Data that is not analyzed has absolutely no value.

As opposed to Mid Data, to the far right of Big Data or Really Big Data, is really the point beyond which an investment in analysis, due to cost (which includes risk of not finding insights worth more than the dollars invested in the Big Data) does not make sense. Somewhere after Mid Data, big data analytics will be impractical both theoretically, and for your firm in very real economic terms.

Mid Data on the other hand then can be viewed as the Sweet Spot of Big Data analysis. That which may be currently possible, worthwhile and within budget.

So What?

Mid Data is where many of us in market research have a great opportunity. It is where very real and attainable insight gains await.

Really Big Data, on the other hand, may be well past a point of diminishing returns.

On a recent business trip to Germany I had the pleasure of meeting a scientist working on a real Big Data project, the famous Large Hedron Collider project at CERN. Unlike the Large Hadron Collider, consumer goods firms will not fund the software and hardware needed to analyze this level of Big Data. Data magnitudes common at the Collider (output of 150 million sensors delivering data 40 million times per second) are not economically feasible but nor are they needed. In fact, scientists at CERN do not analyze this amount of Big Data. Instead, they filter out 99.999% of collisions focusing on just 100 of the “Collisions of Interest” per second.

The good news for us in business is that if we’re honest, customers really aren’t that difficult to understand. There are now many affordable and excellent Mid Data software available, for both data and text mining, that do not require the exabytes of data or massively parallel software running on thousands of servers. While magazines and conference presenters like to reference Amazon, Google and Facebook, even these somewhat rare examples sound more like IT sales science fiction and do not mention the sampling of data that occurs even at these companies.

As scientists at Cern have already discovered, it’s more important to properly analyze the fraction of the data that is important (“of interest”) than to process all the data.

At this point some of you may be wondering, well if Mid Data is more attractive than Big Data, then isn’t small data even better?

The difference of course is that as data increases in size we can not only be more confident in the results, but we can also find relationships and patterns that would not have surfaced in traditional small data. In marketing research this may mean the difference between discovering a new niche product opportunity or quickly countering a competitor’s move. In Pharma, it may mean discovering a link between a smaller population subgroup and certain high cancer risk, thus saving lives!

Mid Data could benefit from further definition and best practices. Ironically some C-Suite executives are currently asking their IT people to “connect and analyze all our data” (specifically the “varied” data in the 3-D model), and in the process they are attempting to create Really Big (often bigger than necessary) Data sets out of several Mid Data sets. This practice exemplifies the ROI problem I mentioned earlier. Chasing after a Big Data holy grail will not guarantee any significant advantage. Those of us who are skilled in the analysis of Small or Mid Data clearly understand that conducting the same analysis across varied data is typically fruitless.

It makes as much sense to compare apples to cows as accounting data to consumer respondent data. Comparing your customers in Japan to your customers in the US makes no sense for various reasons ranging from cultural differences to differences in very real tactical and operational options.

No, for most of us, Mid Data is where we need to be.


[Full Disclosure: Tom H. C. Anderson is Managing Partner of Anderson Analytics which develops and sells patent pending data mining and text analytics software platform OdinText]


Comments 8

  1. Post

    One of the in vogue and most stupid ways IMHO to make your data bigger = “we need to overlay social media (tweets) onto our data”

    1. Will Martin

      Not if the tweet is an annotation to be used in result formulation. Your point is on mark if the tweet is considered part of the data; however it may be … well okay I”m not sure I’d ever do it … just a storage decision.

  2. Pingback: Jeffrey Henning's #MRX Top 10: Big Data vs. Mid Data, Data vs. Metrics, and GRIT Data vs. The Future | GreenBook

  3. Steven Finlay

    This is an interesting piece. It’s about time we moved away from the “Hype” part of the hype cycle when it comes to Big Data.
    People often seem to get confused between the full data set (millions or tens of millions of cases) and the samples required to build good models (a few thousand cases). My own research into sample size “Instance sampling in credit scoring: An empirical study of sample size and balancing” published last year in the International Journal of Forecasting, suggested that the performance of predictive models constructed using standard techniques begins to plateau once you get beyond about 5-6,000 cases of each class (for classification problems). In my experience, the gain you get from moving from 10,000 cases to a million is only of the order of 1-2% in terms of GINI/R2 etc. and perhaps not even that much. That may be worth having, but it’s very much a case of diminishing returns.
    Poor validation is one reason why some people report bigger performance improvements than this. Very few people use best practice validation when building predictive models (K-Fold, Bootstrap etc.) and with single small validation sets, performance metrics can be erratic. Alternatively, people over fit their models on small-medium sized data sets and then compare the performance against models built on much larger data sets where over fitting has not occurred. So perhaps one lesson here; is that good model builders can get more out of a small-medium data set than less experienced people adopting a less intelligent “brute force” approach.

  4. Ed Robin

    Our last (9 month) contract with NYSERDA (New York State Energy Research Authority) proved that the collection of “Mid Data” by some 300+ participants from the New York Mid Hudson Valley (municipalities, non-profits, for profit companies, universities, etc.) resulted in a conglomeration of data collected from numerous sources and formats (stored temporarily in BaseCamp library files. This data must be organized, architect-ed and must be available for budgeting & planning, selecting Sustainability Projects (from some 218 projects proposed during the contract period)via a proposed Score Card/Indicator/Target process using Business Intelligence Modeling while making Dashboards available for participants’ decision making and future Project Management. We are proposing a Sustainability Analytics Center to carry out these tasks and services. Only by taking a “Big Data” approach do we believe this problem/solution can be addressed. Your thoughts.
    Ed Robin

  5. Pingback: Think Small Data, and Triangulate: Tom H.C. Anderson on Next Generation Research Methods | Breakthrough Analysis

  6. Pingback: Think Mid Data, and Triangulate: Tom H.C. Anderson on Next Generation Research Methods | Breakthrough Analysis

  7. Pingback: The Myth of Small Data, the Sense of Smart Data, and the End of All Data | Breakthrough Analysis

Leave a Reply

Your email address will not be published. Required fields are marked *