Start off 2009 with a more philosophical entry…
I was recently in Asia to give the keynote talk at the International Conference on Asia-Pacific Digital Libraries (in Bali, Indonesia!) In my recent travels and talks, I have been asked about the relationship between the latest buzz on “Cloud Computing” and Web2.0 (with its already-evident connections to service-based computing, social web, and social science).
Cloud computing trend might be best motivated by the understanding that data management and computational processing is moving away from personal computing frameworks into a collaborative workspace that is managed in the network. The impact is wide and deep. It’s intertwined with service-based computing, Web2.0, and other trends.
The main value proposition is further “abstraction” that reduces management costs. For example, backup storage is abstracted into the cloud, so you don’t have to worry about your hard disk failing. Computation is abstracted into the cloud, so you don’t have to worry about not having enough computational nodes for your data analysis job. It is an inevitable trend in computing, because of the need to reduce complexity and data-management/computation-management costs. It’s clear that, in the near future, the backup storage and computation will continue to evolve into collaborative workspaces that you never have to administer, nor would you have to worry about backing up your work.
Cloud computing has been touted as the second coming of computing science. That all science endeavors will now rely on cloud computation capabilities. Jim Gray, (the missing sailor, Turing Award winner, and the database guru), once said that the fourth paradigm of scientific discovery will involve “data-intensive explorations which unify theory, simulation, and experiment”. I was asked what I thought of this new direction. Jim Gray is (was) a big figure in computing, so his opinion is certain worth its weight in gold. It’s certainly one approach that would enable us to tackle bigger and more complex problems.
Jim Gray’s fourth paradigm is rooted in his belief that data is at the heart of science — essentially a kind of fundamental ‘empiricism‘. This kind of empiricism certainly has been at the heart of social experiments in Web2.0 applications. This viewpoint was argued by Shneiderman in the recent Science journal as being a kind of ‘Science2.0’. The label ‘2.0’ certainly has some relation to Web2.0 and cloud computing in that the same computational techniques being invented to handle social analytics and cloud computing are needed to do this new kind of empirical science.
The big bet is that big data sets will enable bigger science to be done (if you believe that all science derives fundamentally from observations.) I do worry that this viewpoint places too much faith is placed in blackbox science (i.e. input large data set into database, apply MapReduce or other parallelized machine-learning techniques, and then wham! Patterns emerge!) This seems to place too much faith on machine learning to do much of the heavy lifting. True scientific model building isn’t just finding some parameters on some statistical algorithm. Science has more creativity than that.
From a practical perspective, the need for models and patterns for design is pressing, we certainly can’t just rely on rationalismto generate all of the understanding needed to push forward. So Jim Gray’s paradigm and other versions of Science2.0 are certainly part of the answer to really advance scientific understanding. Big-data-science has certainly been a huge propeller of advanced web analytics, enabling Google/Yahoo/Microsoft to be the big winners in computing. So investing in big-data-science is a ‘no-brainer’ in my book, but one needs to combine it with truly creative scientific work.