Asilomar in Lowell - May 5 2003

Raw notes by Mike Lesk

 

Gong show Sunday Night Hans Schek Moderator

Hans Schek - new infrastructure for information space - this is thousands or millions of databases and computational services that do multimedia and text classification, term extraction, an architecture with lots of services and I call the new architecture "hyperdatabase" - databases can sit on all our devices.  Analogy - a database is a platform for developing applications on shared data - hyperdatabase is a platform for developing services; instead of indexing we need feature extraction.  Research area not just for us.

 

Phil Bernstein – Drivers of research topics: exploit information systems technology, new ideas for classical DB problems, grand challenges, combine DB & non-DB technology, application. Build knowledge-based systems that work: more logical inferencing, richer knowledge representation, engineer combination of AI & DB.  Build experiment management systems: data management for experimental investigation, for science and engineering.  Requires rapid schema evolution; versioning data, schemas and types; heavy use of rich data types and data mining, integration with portal management.

 

Mike Franklin - working on suite of projects about query processing in strange and interesting environments - Telegraph with Joe - stream processing how do this with lots of sharing, adaptive processing, and sensor networks - how push this out into the network - lots more interesting work to be done.  Errors, lost messages, nature of sensing the environment.  XML broker - how process large numbers of Xpath and Xquery queries - 10s of thousands or 100s of thousands.  Also applying query processing techniques to the Grid.  Make it more interactive and more easily programmable.

 

Bruce Croft -  talk from recent ICDE conference - developing new probabilistic model of retrieval - applied to cross-language retrieval, image retrieval, and tighter integration with speech recognition, and MT.  Just started working on pushing that to do retrieval in semi-structured IR domain.  Can we provide IR API for semistructured database?

 

Joe Hellerstein - bringing data independence to networking - network is not just moving packets like post office- trying to write intelligent programs on top of a very volatile system - programs must be robust against nodes coming and going - we're doing sensor networks and peer to peer processing - convergence between graph algorithms and query optimization - we need adaptive algorithms - I enjoy both algorithms and building systems - having more fun than in several years.

 

Jeff Ullman - group should address TIA problem - linked discovery or chains of discovery in multiple databases - the technical problem excites me. Query optimization is not the same for streams as for traditional databases - the new stuff with XML is not just the same as SQL optimization.

 

Rick Snodgrass - methodological basis for our field - now hampered by our methodology - in science knowledge is encoded in theories - scientific theories are testable and make predictions - our basis is twofold (a) how about it, (b) if you need better performance we'll put something on.  We test on a few data points.  We don't have scientific models.  We need a list of needed scientific models.  I have 4 suggestions which I sent out in email - can't do this in a few seconds.  Each model is predictive and testable.

 

Avi Silberschatz - many years ago I had a dream - my laptop was a database machine but universal access to all data in the same way - some people in Stony Brook had the same goal - the database would sit below the operating system.  Don’t want to have to remember things like "lpq".  All the data in the world will sit in some form of database with universal access to it. Lots of research issues- would be great to accomplish this.

 

Mike Carey - stopping doing research a few years ago - I’m in industry waiting for problems to come to me - working on XML - adding workflow to Xquery so it can do data transformations - now using XML schemas to think about your data and Xquery to express integration want to integrate services and data.

 

Alon Halevy - my goal is to get people to stop complaints about semantic heterogeneity - want to automatically match between objects in different databases - experts use names and values - but over time they see lots of schemas and they get good at this. We’re using a big corpus to learn things-e.g. typical attributes for a field named "student" and using patterns of this sort to match between different schemas and reformulate queries for a database we don't know anything about.  This is part of our idea of how to do Google of 10K databases; we reformulate your query.  More generally cross the "structure chasm" between IR and DB world - make it easier for people to author and query data.

 

Rakesh Agrawal - how can we make our data systems more privacy aware - can we design info systems which will be sensitive to privacy and data ownership but not impede flow of information.  Two primary drivers - technology is too invasive in wrong hands - build in antidotes; and new business models that require cooperation plus national security requiring need-to-know for sharing; and the underlying technologies for these are likely to be similar.

 

Mike Pazzani - I'm here to listen - want to hear things that will increase the NSF budget in this area - so one mission is to ensure progress of science and self-generated issues from DB community might lead to 5% increases; this year it might be 15% - to the extent you find interesting publishable problems which help people in the geosciences or biosciences there may be double digit increases - from $9M this year - take things like making it easier to get data into a database if it is not text - if it is a time series or images or chemical models - people have no idea how to do this and they're using undergraduates toolkits to make this easy might lead to increased funding. Second semantic heterogeneity is important - Atkins report says some scientists spend 75% of their time moving data from one format to another and if you helped with that the scientists would love you.  I coordinate our relations with homeland security but TIA is not that different from finding all data about SARS or West Nile.

 

Jim Gray - we do information management; databases are a core part.  We don't do face rec or voice recognition - we don't know data type details - we are good at indexing and storing things and getting them back.  So our top level comment is we do information management.  There are 3,000 people at Redmond trying to integrate things on the desktop - someday if you buy some Microsoft software, Avi, you may get something like what you want.  3 people in my group are doing a poor man's version of this.  Copying everything you do into a database.  No file folder hierarchy - you can pivot on anything - big fans of annotation but computers do the annotation - automatic face recognition and OCR, for example.  Our task is to store the annotations.  The my life project is incredibly good for press attention - Gordon Bell has put his life into this and as he gets older it's a big help for your memory.  Work on the world-wide telescope - put all astronomy data online - let people ask questions and this is the first distributed database I've really found - web databases are great - they ask a query and back come a heterogeneous collection of records.

 

Jennifer Widom - an extremely specific problem which will force us to think about some interesting things - keyword search over XML and have it give the right answer. (she was willing to stop there)  IR-DB thing.  a challenge to deal with semistructured data - deal with metadata - it all comes out if you IR like search over XML which has data/metadata mixed together and we don't know what the data really mean - if you go to XML and give keywords this also has to do with ranking, probabilistic reasoning, and it all comes together. (Schek some people in Europe do this)

 

Stefano Ceri- most of the web systems publish some information which comes from databases and publish some dynamic data  I work on principles which let you build some of these systems - I envisage a future where we could teach how to build web applications.  up to our community.

 

Stan Zdonik - stream processing  systems - addition of quality of service makes them different and unique.  by having quality of service it becomes more difficult Mike Franklin and I worked for a long time - used profiles - tried to understand user needs and application needs - if we want to move to autonomic systems we need to understand how workloads categorize different kinds of information and the QoS part of stream processing helps there

 

Dave Maier - stream stuff is 1/3 of my time, 1/3 is putting your own structure on information, e.g. re-using attention, applications to personal data management, 1/3 is looking at data product management - from scientific data - observed and simulated data, and the fourth 1/3 is distributed query processing, data theory hybrids, trade latency for need to maintain distributed state, have to have some way to talk about catalog and coverage information, not just what kinds of data sources have but what coverage.

 

Dieter Gawlick  - stream processing, started with conventional data bases, 90% of our work is done.  we have sophisticated use of information distribution so people can get a response when they are not around.  we did some interesting stuff expressionless data? - Demand analysis - you have something, who wants to have it.  Oracle streams.  all in products.   as we go forward we have some ideas I'm looking for business as I do updates in a DB I get a stream.  we don't start from events we think of everything as a history the sweet spot for a query is when something comes into the history - just what is on the brink of history.  doing this demand analysis leads to a different model.  I subscribe to a publication and it appears.

 

Gerhard Weikum - working on French woman query and others form this morning working with DB technology plus machine learning and ontologies are also an asset we would like to exploit.

 

Martin Kersten - battle between multimedia DB and database kernel guys; the media guys have thousands of hours of media and they run their jobs for hours - they need array based functionality.  the database kernel people are battling the hardware - we can't get to the data fast enough the CPU is idle 90% of the time so what can we throw out - we toss out hashing, random access, go to streamed processing - a new generation of DB kernels will be 10X faster.

 

Abiteboul - web is a large knowledge base - worked on that - but the web is not yet XML - We should contribute to turning the Web into a large knowledge base. Precise questions get precise answers and not list of documents -- second problem is mixing XML docs with web services - exchange information that is static or dynamic web services – super-excited bringing XML with embedded service calls, active data bases, query processing, tree structured data, and something very important is recursive query processing.  to give you a flavor when you use Kazaa to find records you are doing recursive query processing on the web - not efficient - beautiful technology, now we have a use for it.

 

Timos Selis- formalize ETL by picking the right operators - if you have documents in thematic hierarchies - where does this doc sit where you find it ranking should take account of this - ontology information, semantics, technical part of the work is managing catalogs - building new thematic directories - extend search to use structural properties on top of keywords

 

Dave DeWitt - I used to do optimization for Xquery but I think that's impossible - at the end of my query I'm interested in a few hard core issues - what do we do with terabyte disk drives which make parallelism hard and bandwidth per drive is poor.  we have to treat them as tape drives. optimization: queries are too complex, statistics are not good enough - look at what is in the buffer pool, what pages or tuples, push adaptive  optimization, I’m hard core db the web stuff is interesting but not for me.  me - the Vannevar Bush dream and Jim Gray - everything online  - we have the text - but we need help on the numbers - ordinary people should be able to add info and get it back

 

Hector Garcia-Molina - you have lots of systems interacting but they are autonomous - why will they cooperate - what is the incentive to share or cooperate, even forward messages let alone queries or results.  have to think about how systems get incentivized and why we should trust them. lots of interesting problems there.

 

Laura Haas - I don't really do research any more - very interested in large scale systems - integrate information from diverse systems - range from federated to big warehouses using ETL - space with caching and replication in the middle - interesting physical DB design issue - also crosses organizational boundaries - data placement problems - also need systems to be more dynamic - now they are statically configured and an expert sets them up - want to automatically map to different data sources - the whole grid thing - service interfaces - potential for us - how we can use the kinds of services that are provided by the grid-op-sys people - as they provide security services or accounting/billing our systems should open up and do things like that.

 

Mike Stonebraker - my grand vision in 1974 was System R.  over the last 25 years we have added (a) code in databases - postgres added code (b) spatial data, arrays and new data types - another add-on (c) active data bases - glued on triggers, they are second class citizens, (d) text - object relational systems we put on text but there is no ranking or probabilistic stuff (e) queue management added but not enough for streams (f) parallel database (g) distributed databases especially web ones and heterogeneous and everyone is putting and O?DP wrapper around what's there.  we took a 1974 idea and glued on all this crap - we should start with a clean sheet of paper then we would not get architectures that said DB2 has the data and weblogix BEA has the code.  that's a dumb distribution in a lot of ways.  we have a gazillion of these things - they don't work together - rethink from a blank sheet - new query languages, interface, architecture.  sad Ted Codd just died.  new mandate needed to do this better.

 

Naughton -  meta-level comment - if you do what is helpful for scientists it is of enormous benefit for non-scientists look at web or computers.   will happen again.  every interesting technical challenge exists in the scientific arena - personal db, data streams, web data bases, all in more manageable ways.  and we can experiment on scientists more easily than on corporations.  and funding agencies will support this.  if we keep looking for motivation in problems that have already perceived commercial value industry will be doing it already.

 

Yannis Ioannidis - 2 things (a) if you want to buy a used car you can go on the web and bid; I see a need to buy data products - the value of a product may be a dollar price or in conjunction with how fast you get it or how complete the data is or how reliable - I want e-commerce techniques for query optimization - it's not just buying one thing; pieces of a query may run in different places; Mariposa+++ - multidimensional optimization using e-commerce techniques, lots of theoretical issues also. (b) personalization in databases - I should not get the same answer to a query as Ulman gets - much personalization in the Web but not in DB.

------------------------------------------------------------------------

[I think I heard a lot about streaming, about scientific data, about shared out architectures, and scattered other topics.  few people are doing what they did for their PhD]

 

Dewitt - nobody said they were working on Xquery - (Franklin, Carey, are doing some).  do you guys believe lots of stuff is going to be done.

Stonebraker - nobody mentioned concurrency control or access methods.  we are all out on the periphery. Bernstein:  there are people doing this. work in multiversion access methods.

Schek - there is some work on transactions in a more general sense.

Bernstein - dozen PhDs working on XML query optimization just not in research

Widom - people are doing optimization under the label "Xpath queries under streaming XML". 

Franklin demurs.

Hellerstein - few people needed storage.

Kickoff Monday Morning

Stonebraker -

a)       Vote on the organizer's view of what the important problems are - you each get 3 votes on the topics. From yesterday's.

b)       Tonight's session will be different - 1 hr on personal DB led by Gray & Weikum; 1 hr on "vision" - why did the last 4 reports have little influence - so we need this as a vision thing - just a brainstorming session want to end early today.  Everybody is encouraged to stop when the discussion seems to have petered out, not just use all the time.  The chairs will also do that.

Avi - who is audience for the report?  Stonebraker - the audience is researchers picking research directions and also funding agencies.  If we don't write something $1B goes to supercomputer centers and not us.

Hans - what did we present 15 years ago?  Is that still in the vision?

DeWitt will do the diff.

Bruce Croft - IR & structured data.

What is IR?  

  70s-80s research focused on document retrieval

  90s TREC reinforced the IR==document retrieval view

first bib records, then full text as time went on.

now doc retrieval is important - turned into web search other topics

                                question answering - finding short segments with particular info

                                cross lingual retrieval

                                distributed retrieval - now big

                                 topic detection and tracking

                                multimedia retrieval - images, video, annotating them - starting up now,

                                learning and labeling images and video with text.

                                summarization

IR & databases

                differentiated by unstructured/structured data

                what about marked up text and semi-structured data?

                 text has tended always to have at least a few fields

                recent database papers on nearest neighbor and similarity search

                distributed peer to peer search

                Web search

                info extraction

                text data mining

boundaries getting fuzzier

IR integrated with databases

                many such proposals - now in XML context -go back to 70s

                                e.g. combine ranked search and the specificity of user queries

                supporting a probabilistic framework is the key

integration vs. cooperation: do we really want one giant system?  or should

                we still have separate systems & separate capabilities but they work together

semantic web - "if you made the web a database" - this is make the web into  a knowledge base and that won't happen - we've had a debate for decades about manual vs. automatic representations of what documents "mean" and both work better than either one but creating the manual versions is very hard.  That's the lesson from the IR work

                go for knowledge or statistics?

Stonebraker - every Wall St house wants to merge the news feed and the stock prices - why is it so hard to identify companies whose price has changed and which are mentioned in the news ticker?

Gray - why haven't you mentioned KDD (Knowledge discovery)?  The field is very fragmented.  Every product has a text retrieval bolt-on to their database.

Croft - anyone who talks text data mining is similar to IR - that works well together.  The data mining in structured data - numbers -

Hellerstein - it's all based on clustering, etc.  Same as machine learning - there is a common set of technologies.

Croft - IR people like NL - want to understand how to describe and satisfy an information need in an unstructured world.  That gets us excited.  Yes, we built inverted file technology for large data but we focused on NL and the DB people have different needs.

Stonebraker - If I ask Google "what is the temperature in Lowell" I get a terrible answer. Why can't it invoke weather.com and fill in the Lowell zip code?

Croft  - We are working on that in the question answering world.  You do want some context - you want more than just "73" as an answer (did it come from Bob's home page or where?)  DB retrieval is fact retrieval so there is overlap.  Some people work on extracting tables from text.

Stonebraker - this is similar to the first time I heard the discussion 20 years ago.  The communities should cooperate and they don't.

Hellerstein - Not true!  There has been a lot of overlap, now forced by the Web - the database community feels weak on text - and then we found that the IR stuff isn't that hard.  Cohera and Whizbang are companies that had combined products.  This is a healthy area.

Mike Franklin - How many people have been to SIGIR conferences?  (few)

Mike Carey? We are organized into stacks. We should have a conference on a problem - not by community.

Phil Bernstein - This is always true.  There are always problems in slightly different areas and they always give an opportunity for taking techniques across.  Similarly DL is slightly different from IR.

DeWitt- We could organize a conference on a topic.  I like that idea.

(Martin Kersten?) - We don't need anything new - just join to attack a problem.

Croft - We do need something different than "you come to our conferences and vice versa". 

Timos Sellis - A few more applications?

Croft - Want to have an NL query and not think about it. 

Ullman - Re semantic web - you talk about semantics but when you have to do something you do syntax.  If you take the temperature in Lowell thing you ought to be able just to say "temperature Lowell" - How much more is there to do? Crawlers are bad at this because it is timely.  History in Lowell would work better on Google.  I’m curious as to what you think is the advantage of focusing on deep understanding rather than giving people tools to use?

Croft - This history so far is that focus on deep understanding and semantics has not produced benefits in effectiveness.  Learning patterns has been useful - applied probabilistically - the little words don't help.

Gray - People use mostly nouns and verbs - they can throw away the rest – a telegraphic interface.

Ullman - Temperature is special because you can't crawl the web and get temperature in Lowell today.  Google will decide someday to carefully look for weather the way it can look for maps.

Gerhard Weikum - NL is something that is not that great for queries - you need to understand the text that is there.

Ulman - Google works because it is simple.

Pazzani - The Google answer to "what is the temperature in Lowell" is a web site about Uranus (well actually he says he misspelled temperature – the actual result of the search is indeed the weather underground site)

Serge Abiteboul - The problem is that if you have some info you can put it into plaintext and that's ridiculous.  You have meta information and the question is when you have information if you start publishing meta-info it makes it much easier to avoid NL understanding.

Hellerstein - You can make schemas and just make things harder to use.

Abiteboul - Disagree

Bernstein - It's not just how you say things but how you learn - it must not be a manual activity to attach metadata.

Gerhard Weikum - takes over leading observations on DB, IR

business data is boring

                action is e-science, e-culture, and entertainment

absolute facts is a myth created by accountants

                uncertainty is fact, ambiguity is fact

hope for precise semantics based on universally agreed upon

                ontologies and perfect metadata

IR:

similarity search with ranking is the best approximation to semantic search

DB:

                can still leverage context - metadata, ontologies, multivariate distribution)

agree with Ullman - no such thing as pure semantics.

 

killer queries  where Google, dbms fails

Find gene expression data and regulatory paths related to Barret tissue in the esophagus.

what are the most important results in percolation theory?

Are there any theorems isomorphic to my new conjecture?

Find information about public subsidies for plumbers

Where can I download an open source implementation of the ARIES recovery algorithm
 (needs to be decomposed into several pieces).

Which professors from D are teaching DBS and have research projects on XML

Who was president of the US when George Bush was born?

                  (can't do the decomposition and linking again)

"Who was the French woman that I met at the PC meeting where Peter Gray was PC chair?"

                                a) go through email archives and find which program committees I was on

                                b) then look to find the chairs of those committees

                                c) then having found that this was VLDB 95

get the list of the members and see that Sophie... came from Inria, Paris.

                                d) know that Paris is Paris, France.

Garcia-Molina - you are working in AI.

Weikum: Looks AI complete but you can do this with dumb things

Croft - Finding isomorphic theorems is the hardest one

Weikum - There is an "open math" project.

Croft - For question answering actually TREC does fairly well on that.   There are a lot of factoid questions and the current systems are finding 70% of the right answers in the top one/two. But these are not factoid questions. ARDA now sponsoring ACQUAINT which looks at things like this in the intelligence domain.  They want to find authoritative docs.

Weikum - People expect to type a few words at Google and get the answer the goal should be to minimize human time - you learn how to rephrase query

Agrawal - Some queries will get money: "what websites accept Visa/Mastercard but not Amex" - Amex will pay for that;  but many queries people won't pay much for an answer we need to understand which queries have to be cheap and which can be expensive.

Weikum - Not sure money invested in the right things. 

Timos - What is missing from DB ?

Weikum -  Knowing which database to look in for the "gene expression data related to Barrett tissue" - there are many gene databases on the web – and each has its own schema.

Halevy  How much is understanding the query and how much is mapping it to formal SQL.

Silberschatz – Some, I see how to map into a DB and the one about math I can't.

Weikum - There is the open math activity - suppose you have high school math text books and we have codified them into logic. Some inferencing capabilities in that - you can then mimic this.  Pattern matching on XML.

Croft  - What are the drivers for integrating IR & DBMS?  You could build special purpose systems for each of your examples - or you could try to do this as IR.  But where do you have to unify the systems or make them communicate?

Mike Franklin - That’s the key question

Stonebraker - If you want metadata - e.g. super-duper UDDI - that's what we bring to the table.

Weikum: Shouldn't we formulate this as a meta-query - not SQL.

Halevy-The fundamental problem mixing the two worlds is that we have a subquery in some formal world and we go to a repository and all we have is text.  How do we come back with an answer to do joins?

Weikum - You could XML all the data you see on the web; but not sure which tags are important. Asked students to do researcher home pages and grossly underestimated the difficulty.  And it still doesn't handle ambiguity

IR strengths

  methodologically rich - statistics, probl, logic, NLP

  appreciation and experience with machine learning

  awareness of cognitive models for end-user intention and behavior

DB strengths

  integrity, scalability, availability, manageability

  system engineering

  resource optimization - caching, memory mgt, query opt, physical design,

  scheduling

Mike Franklin - Databases allow manipulation - update, summarize, aggregate. This is more than IR does. What about "find average salary of a US CEO" which will require computation.  Breaking this into the two queries is hard. Argument about whether it is easy to figure out that you need an average. The IR people can find a table but not do an average; the DB people are the reverse

Croft: IR and Google are not synonymous.

Stonebraker – It’s easy to express your query once you have a table; the hard part is putting together the table.

Hellerstein - You don't even know what the breakdown should be.  It is harder than you suggest.

Maier:   Human attention is scarce resource.  Where do you apply it? Writing metadata?  Google harnesses this a little bit.

Weikum  - DB & IR: issues & non-issues issues

  exploit collective human input

  use ML & ontologies

  flexible ranking to XQuery

  use ML to convert Web to XML

  extend Google to deep web

  break google monopoly

  acquire broader skills

non-issues - we can do these

  crawl structured data

  simple IR on XML

  polish XQuery and implement efficiently

  homepage.xml schema

Again we need probabilities.   As a special case we do traditional DB with result certainty 1.

Google is popular because of ranking and coverage  

Ullman: No,  they were popular when they had less coverage.

Weikum Afraid of Google having a monopoly.  Want to have a peering system that spreads out queries.

Mike Franklin - Purest merger of DB & IR is in annotated scientific databases and this problem is important today.  You need both DB & IR.

My session on infoglut - rather flat

 

Mike Franklin.   Info shadow is a problem.   I look for Canyon Creek development near me and it is buried under a lot of stuff about the same name in Texas

Gray - We need spatial search and also time – this pushes to a schematized metadata search – not just flat text.

Lesk - also proper names

Bernstein - Yahoo does categories.

Pazzani - Google had a student contest for new feature and the winner was geographic search.

Lesk - we need to think about video

Hellerstein - Sensors and sensor fusion generate lots of info there that is somewhat  structured.

Snodgrass - We need results on impossibility.  Which IR tasks aren't worth trying.

Lesk:  IR doesn't do much of that.

Croft - We try to categorize queries One TREC category is web search.  We are learning about queries and what we can do, which will work

Gray - DB wants schematized search. IR doesn't.  Syntax has lost to statistical search in IR. Is  there a place where syntax works?

Lesk - OPACs - but Amazon seems to do better.

Gray - We should make the Web available as a study item for linguists, sociologists, etc.

Garcia-Molina – Our group at Stanford is doing that.    

Stonebraker - I have a 6th grade daughter - her teachers ask her to look up dinosaurs.  It is hard to find things appropriate - journal articles worthless.

Lesk – Search should do Flesch score and picture/text ratio.

Gray - Spam - learn what is not interesting and what is interesting in user context.  Profile people

Croft - Contextual IR is an active area.

Move to what DB flavor in e-sciences

Hans - You said you wanted convergence - should ask about mergers

Bernstein - We focus on the query, The IR stuff is on preprocessing - organizing,  he says MeSh/UMLS got them from 50% to 80% performance - likes thesauri

Lesk - Disagree schema helps very little. Described how the Internet Archive works. What could databases add?

Gray - It would run a lot faster – it is unusable now.

Dave Maier - Tobacco docs - comparing with open lit - query by a smoke chemist is "what  in these docs contradicts the journal literature?"  It is hard to do that.

Croft - That's info extraction

Lesk - see Futrelle doing that.

Weikum - What is the purpose of the query e.g. insider trading.   That's text and numbers.  They organize ahead of time. In your case you didn't know in advance how the documents would be used (about smoke chemistry.)

Hans - What services do the sides provide?  IR can do text categorization, DB can do engineering

Final vote show of hands: 3-1 for "I couldn't find it" over "it wasn't online"

Serge Abiteboul - XML  (prepared with Jennifer Widom)

  a boring research topic?

  a new frontier?

  a means to keep standards people busy?

XML

 rapidly adopted by industry

                format for exchange of small/medium pieces of data when archived grows to large volumes

                a data model - for a wide range of kinds of data.  not relational - permissive typing, full-text search

the database community should be involved and perhaps concerned

XML issues

                storage of XML

                native vs. XML-relational

                 lesson from OODB - this is a business issue  but the vendors are

                not trying to block it

                efficient representation and compression

                key issue is interface - not clear whether it should be like a DB -

                DOM, SAX,  - or a query language -- needs work

                revisiting old topics

                database design

                integrity constraints

                 concurrency control

                access control

reinventing the world

universal query language for XML

                problems with Xquery - promoted by W3C

                focus on complex queries need simple filters, IR style search

                too complex, ambitious, too much politics

                can you really go from documents to data

                people want to do what they did in SQL and others want doc search - this is hard

                can we undermine Xquery with something better?

                                thinks we need small core OQL plus plug ins

                                running late - we need standard now

This direction deactivated by XQuery

                scientific: is Xquery good or bad from a scientific viewpoint

                politics: should we push for it

Weikum: SQL can be segmented.

Stonebraker - You can't talk about Xquery without talking about schema.  That is what has to be subset.  Big tension between what the doc guys would like and XML.  XQuery does everything.  It makes IMS look simple.

Gray - We had identified Google with IR and now we are identifying Xquery and Xschema with everything.  We should think from a blank sheet of paper.  OODB did not fail;  it made object-relational possible.  An approach here is that the train has left the station.  We can't do much - there is an alternate path which is a much simpler query language.  You should pursue that if you have a better idea.

Query optimization

                for subsets of the language

                tree structure is a new ball game - new index structures, cost models, etc.

                 depends on storage

                revisit distributed query processing and view maintenance

everything being studied

 

Foundations

                lots of work on semi-structured data

                first-order logic and relational languages: strong

                OQL/functional languages: reasonable

                full-text search: messy

 typing

                 much more complex than in relational world

                 not settled

                query type checking, type inferencing, update consistency

                very active area - people from DB theory, functional programming, etc.

all this again is active, but problems not simple, need more work.

real frontier: world is changing

old vs. new data management

 

Old                     New

closed world       openeness

client/server      P2P

distributed db     web-scale data

query/answer       subscription queries, stream queries

active db          active databases + web services, service discovery

QBE interface      new interfaces

 

research must focus on new issues - not single site data

beyond XML: the semantic web e.g. putting music on the internet was a very nice problem and the solution

was elegant (Kazaa) even though the lawyers disagree - uses little traditional technology

 

Widom - When did you add semantic web?  I'm not responsible for that.

Abiteboul - All this is syntax.  Makes Ulman happy; the most fundamental difference from relational DB to web is that you don't know the semantics. 

Ullman:  A high order bit for the report is "is querying XML too important to be left to W3C".

Stonebraker:  A simple thing to say is that Xquery is a pile of crap and XML schemas are a pile of crap and we can't influence that. If we had a clean sheet of paper and wanted to do something right, we would focus on merging doc world and structured data. No standards body can do this.

Widom:  People are implementing this it's too late.

Lesk- So what is an XML success story?

Abiteboul - Newspaper articles - All were in separate formats.  Now they all use XML,  particularly NewsML.  Now we can merge 5 newspapers.  You have parsers and editors  and you can publish with very little effort.

Maier - The tools are very important.  I studied data interchange formats and found that people agree on what things mean and without tools there weren't used.  Some things left behind like array data.

Gray:  Another plug for code+data; HTML started and people wanted to send script and when you send me XML I don't know what it is, just a bunch of tags, you have to send me the methods as well.

Abiteboul - Before methods you need metadata; then you provide code.  We should be more active - things like UDDI are dirty.  We should be helping here.

Gray - Dave Clark has a nice model for standards.  There is a period when it's too early for standards and a period when it's too late - research and production phases.  You need to be in between.  I do not think we are at the standards phase with our ideas yet. We still need more prototypes.

Abiteboul - We're working on Active XML - XML with embedded calls.

Stonebraker - You said we have to worry about views and updates – everything that came along with the relational model.  It will be more complex – this is what collapsed IMS.

Widom - You can write lots of papers. 

Stonebraker You're too optimistic

Abiteboul We have a lot of models.  In a distributed session you probably will do some integration of things that are very relational -- integrating at the tuple level.

Stonebraker - Part of the IMS difficulties were restrictions on views.

Abiteboul  - OODB also had trees.

Ceri - If there's a lot of XML data out there we don't have the luxury of not dealing with it.  Because hierarchical is the wrong way from scratch we can't ignore it.

Gray - The IMS data model was designed by blue-collar programmers, no theory. Don't postulate that there is no good hierarchical data model because IMS failed 30 years ago.  Nobody has ever tried properly.

Bernstein - We can count on incremental forward progress.  All the relational products are making big investments in XML.  The data capture is inherently semi-structured.  e.g. there is always a "comment" field.

Widom It would be absurd not to bless the area

Bernstein - But people do think it is boring, the same areas as ten years ago.

Hellerstein - We should focus on more IR things with XML and here is a list of plausible real problems (the "new" in Abiteboul 's last slide).

Widom - We spent all morning moaning about structured data and IR.  This is a chance to do something about it.  The next language should be more IR-ish what went wrong with Xquery?

Alon - Too many politics.

Lesk - Look at Dublin Core - every time they meet the standard gets thicker and heavier weight.

Maier - The manuals are too thick -  but SQL is no better.

Kersten -   Query sessions are missing from this discussion - not just one query in isolation.

Iannis - Database people know queries.  Actual users explore in unstructured ways and this often finds the most interesting things.  Queries are important;  but other things are too.  Context, personalized stuff, other modes of interaction.

Lesk - ranking, visualization

Hans - processes, flows, combinations of services .

Ceri -  I want similarity based browsing.

Snodgrass - We don't know if algebra is better than calculus.

Iannis – It is not an issue of calculus vs. algebra.   Declarative vs procedural is more important.   I did a study:   for simple stuff declarative is fine for more complex stuff procedural is needed.  I don't know what kind of interface to give people.  But, none of this has to do with XML.

Franklin: Semi-structured data is the big issue.

Maier – Why is there is no XML on the web.   Are we doing anything to help with XML that is streaming?

Abiteboul  - Two questions a) not much public XML but lots in industry b) how do you handle changing data?

Hellerstein - If you take queries over streams and add distributed databases you get routing which is a big topic in the networking area.

Pazzani - In a startup XML is being used as an interchange language and then it gets dumped into relational DBs.  Also used as an intermediary for different screens, etc.  Not much going on in XML data bases.

Bernstein - Quite a lot going on.  Talk to vendors our product people can list many big time customer with lots of XML data.

DeWitt Is it simple or complex?

Bernstein - They want to do queries. There is a wide range of tasks. We can't move fast enough.

Widom There is no relational on the web either.  We don't ignore RDBMS.

Bernstein - Research on XML as a data model also has room for innovation.  Don't be negative about lines of traditional database research that can be applied to XML

Widom - The conferences are 1/3 XML now.  It is not problem that there is not enough work.

Stonebraker - If you do research that competes with the vendors. That's not research.  A big problem we have is that a lot of what we do is too close-in.   Vendors will do this.  We should do something Oracle is not doing.

Widom – For example, query optimization for XML is not for researchers

Stonebraker - Yes.  Don't do that.  Leapfrog to the next data model. XML stinks it is too complex.

Widom - XML and XML schema are different the schemas are too complex

Hellerstein - Our CS colleagues won't fund us to work on XML query optimization, but many other things would sound better.

Agrawal – This is not a firm statement but anecdotal info is that XML being stored right now is very simple.  A relational tuple or other simple structure.  The complexity of schemas that are coming is justified.

Widom – For example, an airline record has a few structured fields and the comment field;  that does not need all of XML.

DeWitt - We should take a stand.   We're going to get blamed for Xquery and Xschema.  People will say it came from the DB community.  Ullman said we should repudiate any association with Xquery.

Widom - We can't do that we are already associated with it.

Croft - As an outsider reading DB papers I do blame you for Xquery.

Stonebraker This is easy we can say it is commercially important but we can do better.

Maier - What should we do as a data model if our goals are openness, peering, and so on?

Lesk:  Whatever you do put <> around it and call it xml.

Widom - Nobody has a beef with just XML

Abiteboul  - XML is just markup with simple markup,  then the schemas come and made the problem.

Widom - Why did everything get so complex?

Mike Carey - what is our purpose as a community? 

  1 - produce great new ideas: ie write off Xschema and forget it

  2 - structure the field (credits to Jim and Phil)

  3 - educate the workforce - Wisconsin produced students with experience

        building industrial strength software claim Paradis better than DB2

Gray - Some of us - Dieter & me - work at companies with hundreds of PhDs who are doing the "how to make XML work" part.  The community is working in this area, but where should the research work, not advanced development, go?

Carey - If we focused entirely on research many of the Wisconsin thing would not have happened.  

DeWitt - Should we focus on Xquery optimization so you're educating the work force for the current jobs?

Gray - the academic community completely ignored SQL.  They said it was brain dead.  That was fine, it happened anyway.  I think we are in a similar state re XML- XSD-XQUERY today.

 

Stonebraker calls time.  ten minutes to lunch.

results of the poll on the gong show

    federated, heterogeneous              13

    querying the internet                      10

    personal db                                      8

    open source                                      5

    privacy                                              5

    visualization/new interfaces         5

    probabilistic                                    5

    autonomic                                         5

    db tools/cybertools                         4

    experiment management                4

So how adjust schedule:   Add querying the internet?

Hellerstein – No, we've done that.

Bernstein - We discussed visualization in 1989 - it never goes away.

Dave Maier –I would do experiment management.

Agreed to add that.

Stonebraker will do visualization, interfaces; frustrated that no one in this room is working on better UIs.

Aside: Abiteboul  is working with BnF on archiving the web; they are changing the law to get legal deposit on French (country) websites.]

Haas - Top 10 Reasons why Federated Can' Succeed and why it will anyway

Carey

Brief history of federation

  Multibase @1980.

  many attempts since - every few years with new model  

                functional, relational, object-oriented, logic-based, XML

  still not solved.  last night we all brought it up again

  will we ever solve it?

Haas

 top ten reasons against federation - I get whines about all of them

10. Robustness:  Systems fail, sources unavailable, more pieces mean more failures, so with robustness. (objections: DeWitt - google; Hellerstein - peer2peer; Stonebraker - your company is selling "sysplexes" which are single system of things that can fail;  One piece of big iron will do better than 500 linux systems - sort of anti-federating.

9. Security: different systems have different security mechanisms, hard to have a coherent view of permissions; more points of failure, harder to make guarantee; and data is sometimes the "corporate jewels" and needs to be protected.  Schek - look at e-health: would you trust that to a federation?

8. Updates recording change is not always an update.   sources may not be databases; may have to go through an application API to do an update ACIDity - not all data sources support ACID properties – transaction semantics not always possible.  e.g. our current system doesn't support 2-phrase commit.

7. Configurability:- hard to set up too many architectures possible; many choices, little guidance.  Lots of code to install and lots of connections to support

6. Administration - hard to keep up monitoring is hard; not all sources have tracking facilities; tuning is difficult; repairing is painful, need distributed debugging and you have to deal with different vendors

5. Semantic Heterogeneity:      hard to identify commonalities - same terms, different meanings (but this is also a problem in a single system with the same data)

4. Insufficient metadata: all sources have different metadata with no uniform standard

3. Performance (data movement): need to move data, geographic distribution is common and the WAN is slow; large data volumes common and you can't just cache because changes can be frequent and hard to track, plus storage is not unlimited.

2. Performance (complexity): decision-support applns do complex queries and choices give big differences in performance.  Some sources may not have enough CPU power and you need expensive functions of data.

1. Performance (path length) simple queries - even OLTP like - have huge overheads simple queries are common - easier to write, automatically produced. Should use one big query for performance but not written.

Mike Carey – Q: we have had these problems for 20 years so why will federated succeed? A:  It has to: integration is a top IT issue and not going away alternatives are expensive and/or painful write it by hand with 10 different APIs.

     EAI/workflow solution consolidation - warehouse, data marts

Maier - How do you know about the data?

Bernstein - You do this in big  meetings.   Also simple scenarios exist - may not need high security or robustness for some applications.  Customers know the data; need is great and compromise is possible.

Progress being made - 20 years of distributed query processing.  Plumbing is in place; connectivity there.  Reliable messaging. XML is now sort of basic agreement on how to exchange data.  XML schema  is a way of describing data.  So we're getting closer.

What would we do if it worked?

   retire?  integrate the web - data google?  p2p database?

Is research warranted?  what are the most important topics?

Bernstein - The piece of this where we're making progress is semantics.

Maier - Look at blame allocation - be able to write down expectations of what the pieces should do and then be able to see what is happening.

Ulman - When you have enormous amounts of data you have to be uniform in your dealings.  You can't write code for every 100 bytes.  Once you have declarative languages you have to use query optimization. 

Stonebraker - Cohera found out that you didn't mention is that  semantic heterogeneity nearly always involves dirty data - and cleaning data is better done in bulk. 

Maier - In health data they want to get something going. Federated is easier and if that doesn't work fast enough they might try to put it in a warehouse.

Haas - We are doing a service integration system based on db2.

Croft - Does federation include resource discovery? Does it include schema?

Haas - Federation includes metadata - I didn't consider resource discovery separately.

Halevy - To feel better about what we have done we need to focus on who are customers are.  If people can put things in a warehouse they will do so. We need to go after the people who can't do this, who must put data in a warehouse.

Mike Franklin - Semantic heterogeneity not so bad.   Security is more serious.  They won't let people into their systems.

Carey - Sometimes all you have is a minimal interface

Stonebraker - You often have a non-relational interface which you have to wrap and then try to federate at a relational level - You might be better off at web level.

Garcia Molina - Why didn't anyone else vote for workflow;  Distributed workflow is similar.

Hellerstein - On topic of reliability, the is lots of exciting work in networking.  You can find key value in log number of links - p2p networks.  db community don't talk to these people.

DeWitt - Distributed hash tables are not going to solve the world's problems.

Hans - You have underemphasized the problems of security and reliability. We can't live with low standards of accuracy - again see electronic patient record.

DeWitt - So what is the message? Laura says it’s impossible and Stonebraker says its done.

Lesk – The intelligence community tells me you only get a keyhole into db - they refuse to federate.

Agrawal - They want "need to know" information sharing - minimal information to be delivered.  We have paper coming out.

Stonebraker - Two great success stories & one great failure.  (1) Airlines have been federating for years - very successfully.  When you have only half a dozen elephants and a huge incentive it works. (2) Both Dell & Wal-Mart have federated their supply chains.   One big enough elephant.  (3) RosettaNet - electronics community trying to federate their supply chain.  No big enough elephant and so it is not working. There is the same problem in autos.

Laura - Will work in specialized cases. we should solve some of these problems.

Hellerstein – Tools are good. We won't solve all of these - we need to deliver tools to content managers.

Ullman & Rakesh Agrawal: "Data Mining on Steroids"

Ullman – I am the only CS person who says in public favorable things about TIA. The DARPA John Poindexter & AI community project.  On 9/11 you had four guys with visible Al-Qaeda connections who went to 4 different flight schools with no connection to an airline.    If you could integrate all these records, you could have asked the right query.  This happens at two levels. a) How many al-Qaeda guys have been to flight schools? b) Even more ambitious - What strange things are going on?  But how rare was this?

 

This is an interesting problem ; locality-sensitive hashing to focus on connections.   We need to find just a few events that are the most interesting.  The technology is not there yet but it is an interesting problem.

Gray - The license plate of the guys who were the Washington sniper was looked up 18 times in a few weeks. Nobody noticed this large number of lookups (and all were in the vicinity of one of the shootings) - because of different systems.

Ullman - You need Bayesian theory to tell you how unlikely something is.

Agrawal - Data Mining - Potentials and Challenges

observations

                some transfer of data mining research into products

                most in vertical applications

                horizontal tools - SAS Enterprise Miner, DB2 Intelligent Miner

                data mining in non-conventional domains

                new challenges because of security/privacy concerns

                DARPA initiative to fund data mining research

identifying social links using association rules

                crawled about 1M pages and found Arabic names and charted links to make

                  a social network.  the most popular name was Al Gore- they blew the

                Arabic name identifier. 

Hellerstein - Why not use a graph clustering algorithm?

Agrawal – We are using association rules.

Ullman; - You need a strength measure.

Agrawal - website profiling using classification.  training on labels like "Islamic leaders", etc.

Discovering trends using sequential patterns and shape queries - trends in patents, heat removal, emergency coolings, zirconium alloy, feed water. You look for a shape of the graph of % mentioned vs. year of those words. You sketch a "resurgence" in this case - V-shape - drop and then come back.

 

They are discovering microcommunities - tightly coupled bipartite graphs – e.g. Japanese elementary schools, Australian fire brigades, - you find tight graphs and then you manually label the areas.

 

new challenges 

                 privacy preserving data mining

                randomizing the data in a way that destroys individual data but not the summarizing stuff

                cryptographic approach

                privacy preserving discovery of association rules

                data mining over compartmentalized databases

                frequent traveler rating model - with demographics, credit ratings, criminal records, etc.  

TIA was going to build a giant warehouse and got flack
perhaps one could use randomized data shipping or local computation.

 

Croft - System to return a probability that it can return relevant data and then you go get permission.

Stonebraker - My discomfort is that in theory all warehouses are built for data mining but in fact nobody is doing any of it and the vendors are going broke.  The people I talked to were doing fairly simple things.  No statistical expertise on their staff.

Agrawal - Lots of leading companies are doing this.

Weikum? - The field is approaching saturation.  Interesting research but it is not for 10years.  It's incremental.

Silberschatz:  If we solve TIA in 10 years  I would be surprised.

Ullman  - even if you give me everything in the world integrated I still can't ask the right question.  even more mundane - what is a gene.

Agrawal

some hard problems

                 past poor predictor of future

                 abrupt changes; wrong training examples

                actionable patterns

                how do we find what is surprising?

                over-fitting vs. not missing the rare nuggets

                how insure not overfitting - still hard

                richer patterns

                 in medical domain - you need dags

                 simultaneous mining over multiple data types

                text voice and structure data

                when to use which algorithms

                avoid the everything looks like a nail to a man with a hammer

                automatic selection of algorithm parameters

CMU is now offering a degree in data mining (Tom Mitchell running program).

Pazzani - Management schools have been doing some of this for decades

Hellerstein - Many of us don't understand statistics - we should be  educating ourselves.  The undergraduates should be taught a bit more.

Gray - There is a popular book by Jiawei Han that is a nice intro and course.  The challenge is that SAS and other tools are chauffeur driven.  We have to make it easier.  The science community has a size problem.  Business has 1000s or 10ks of records or can subset and use quadratic or cubic algorithms.  Science users have very large datasets (billions). They need log-n or linear heuristics.  GenBank is about 40 GB right now - fairly small.

Hellerstein - We have an area that overlaps with statistical AI.  We need to talk about what we contribute.  people tell us our math skills are not up to the job.

 

Discussion

is datamining "rich" querying?  is it "deeply" integrated with database systems.  most current work makes little use of database functionality

  should analytics be an integral concern of database systems

  issues in datamining over heterogeneous data repositories.

 

Weikum: Should data mining be linked to data quality?  Biomed people very anxious about this. 

Agrawal - yes.

Pazzani:  DB community could teach machine learning about data that doesn't fit in main memory.  You must avoid things that take 10 passes over the data.

Snodgrass - Perhaps we should focus on summarization, visualization, then let people make deductions.

Ullman - I agree, this is one aspect but if all you have is visualization you need help.  Suppose you have 10-D data and you have to know which are the most interesting dimensions. 

Ceri  - What about semi-structured data?

Ullman - I've seen it but it's derivative. 

Abiteboul  - I've also seen it.

 Break

Stonebraker - this is boring, what to do?

Density of incrementalism to insight is high.

Gray & Lesk: Suggested tossing the agenda and asking if anyone was passionate about anything other than selling your own research.

Schek: - We just have too few breaks people want fresh air (1/2 the group had left after the break).

Maier - So what?  Should we plan the  wake for DB?

Gray - In previous meetings there has been conflict - relational vs OO; logic programming, XML.

Stonebraker: I'm happy to present a controversial vision statement. What's the purpose of this meeting? In previous cases there were research branches - right now I don't hear the controversy - we are all working away - not at a turning point.

Gray - Why are we here?  It's a 5 year interval - no specific agenda. It was not the field is in crisis.  Last time we said text was going to be important but we have not done squat. 

Schek - Other people did the work.

Ullman - I proposed 1 hr ago that the DB community should take charge of TIA.  Use a systems approach.  the spirit of TIA today is an AI spirit.  Describe a wonderful vision with no idea how to do it.  I'd rather work on version 1.0. 

Croft: - Enumerate research issues in TIA   

Ulman - Make clear it is a database rather than an AI issue.

Snodgrass - If you look at last reports they state 30-40 year goals at high level and of course we haven't reached it. 

Agrawal:    We should have some nearer term goals.  

Croft  - So what have we done in the last five years? (Xquery?)

Stonebraker -It looks to us like we're dead on our feet. 

Gray - I'm excited but it's applications, and I'm filling in gaps.

Stonebraker – OS people have quit doing that work - perhaps DB is a mature field and we should also drop things like query optimization. So I propose- we morph after dinner - 3 or 4 people to present visions of some sort that can't be achieved in ten years and listen to that.

Agrawal - One thing that would focus or excite us is some interesting application and TIA might be that thing  It has database issues.

Gray - I have political problems with that. TIA has a big-brother overtone.

Stonebraker  - This evening is anyone can get 15 minutes to say something that can't be accomplished in the next decade.  No restrictions other than that.

Ulman - I understand the political issues about TIA - but it needs to be done. Just as city dwellers 5,000 years ago needed walls around their cities.  It is a national need.  The government gives guns to 1.5M people and relies on them not to invade your home.  The political problem is to create analysts who get information and don't abuse things.

Stonebraker - This is a subset of heterogeneous federation and data mining.

Lesk: Three challenges e-science, TIA, personal memex [we've now killed 20 mins without getting anywhere]

Stonebraker: Integrating the deep web.

Gray: we have 24 hours left.  Is the field really stagnating?  Should we look for other careers?

Stonebraker - This discussion is very similar to the one 5 years ago.

Abiteboul  - In 1981 people told me databases were dead.

Gray - What has been discussed so far is incremental.  Oracle, IBM working in the mainstream.  What should the researchers be doing?

Abiteboul  But those guys don't publish so we need to do the same work.

Gray - They write a lot of papers.

Croft - Other areas are defining testbeds - so people could compare techniques. e.g. MT recently - was moribund and then defined a new measure 1.5 years ago and excitement is way up.  (overlap of ngrams).

Ulman:  When you define a measure of progress people make it increase.

Croft - You have to come up with good measures

Maier - Alon was saying for semantic integration what if we found something for people to try - a corpus of 1000 large databases.

Garcia Molina - Why is it bad to have the same list as 5 years ago.  These are hard problems - should we only work on things we can solve in a year or two?

Bernstein: - It would be a problem if we had only the same solutions and were making no progress.

Gray - What progress have we made in the last seven years.  Lots of things in data mining, cubes, auto-tuning, materialized views. In 1996?1976?  Don Slutz was sending queries to DB2, SQL systems - 90% of the time he got the answer and the rest of the time he got a crash.  Today you can use database systems and that is a result of research.  Research in QA, fixing query optimizers.

Garcia-Molina: Is Google an accomplishment of last five years?

Silberschatz: Do we teach Google in DB community? -general yes I have a lot of data on my desktop and I don't use any database tools to manage it.

Bernstein - People use Outlook to manage their contacts (1/3 of the room?)

Hellerstein - failure with Gong show is that we talked about other people's work.  (Laura had said this earlier).

Q: Should we just repeat the last report?  Say it was the right program.

Croft: how do we move ahead? A number of people said this was a really exciting time - so much data around and people care about it.

Lesk: - Get people to do their own queries.  just like IR.  that's what made it exciting.

Maier - We have a lot of people who were at Laguna.  Many of us are on their last research project.  I can't do something which is ten years out. Maybe we have the wrong people.

Hellerstein - Disagree completely;  wisdom has value. Phil can e.g. take risks at his stage in the career.

Garcia-Molina - The world is knocking on our door.  There is a threat from terrorists and are we going to say there is nothing to do.

Maier: Who's bored with their current work?  (only Ulman: puts his hand up) Carey and Halevy were the chairs of the two main conferences- What are the big issues? 

Halevy - We had a lot of data mining papers and all but one were rejected.

Stonebraker  - I can summarize as "in the past there has been a sea change" and in 1997 it was the web.  Now we're just plodding along.

Gray - Webservices are a sea change.  People can now publish info on the Internet, not just html.

Abiteboul  - Deep web.

Franklin - Instead of a gong show we go around and you get 30 seconds for what excites you.

Stonebraker - we will spend 1 hr after dinner giving 2 mins to each person to say what you're excited about or to present a grand challenge.

Stream processing - where's the Beef or Beer   Dave Maier and Stan Zdonik

applications

                real-time enterprise

                financial data feeds

                supply chain management

                sensors

                environmental monitoring

                RFID - radio frequency ids - e-zpass type - Gillette just ordered 500M at 10cents each.

   Network monitoring

the sensors are the things that have triggered the big interest

 

what are the issues?

quality of service?

what's wrong with existing technology?

 

issues

                push+latency: the data just comes but it ages fast

                dbms - system controls data flow and optimizes throughput

                sdms - sources control data delivery and you optimize latency

                update followed by query - not fast enough

                overload is possible - rate-based processing

DeWitt -  I see no evidence that optimizing for latency & throughput are different. If you take a standard DBMS and forget about persistence it's the same.

Gray - Standard systems have response time thresholds and try to answer as much as possible.  It is the same thing.

Croft - We also need different architectures to do 100K profiles against news wires.

Gray - In databases you treat queries as records and it works.

Maier -Is there always duality like that - queries and data invert?

 

adaptivity

                loads change - so can not do a static plan

                adaptive optimization issues

                scheduling, load shedding, distributed bandwidth-aware optimization

correctness

                semantics may not be deterministic

                approximation, independent streams not synchronized

                transactions do not seem central

                update in place not the norm

                overlap of answer arrival with query processing

                mix queue-based processing with traditional storage

 

Silberschatz - At Lucent we worked on real-time billing - you append the call record in the database - and later you ask about the database.

Hellerstein - The only fun here is when you do distributed - push processing into the routers. 

Franklin - You are missing (a) multiple query optimization problem, and you have to handle queries entering and leaving the system.  (b) we have no agreement on what semantics you want - no agreement on time windows, etc.

Ceri - You have all these queries coming and you want to combine them – need better ways to do that.

Carey - need a benchmark for streams (Widom says it's being done).

Stonebraker - we are writing a linear road benchmark and running on stream systems as well.

 

QoS - quality of service during overload conditions

                overload -> degrade answer

                who needs it?  same for all?

                admission control (turn away) or priority control (delay)

                                on requests; on data

                degrade operators:

                                smaller window size; approximate operators

 

need to know what the use is -- a billing system isn't going to throw away the billing records but a sensing system might well toss inputs.

Gray - We can overprovision.  You need to worry about fault handling not overloading. 

Stonebraker  If a missile is coming in you drop everything else on the floor.

Gray No you don't you have dedicated missile tracking hardware.

Stonebraker That is not what they do.  I have talked to these people

Zdonik - When something is coming in you get too many pictures.

Maier:

What's wrong with existing technology?

                MQseries + Websphere + DB2

                performance, performance, performance

                scaling: speed, volume, # of requests

                too many boundary crossings (between these three systems)

                linear road benchmark intends to prove this

                second order effects

                 data model is wrong

                 triggers don't scale

                No QoS

[losing track of this]

 

Q - Is the linear road benchmark the OO-1 of streams technology - that showed relational was 1000X slower than object-oriented.

Maier - Can we build such a benchmark?  Benchmarks can be drivers of innovation like TPC - expanded database capabilities.

Franklin - The reason OO guys liked OO-1 was caching - Relational guys still have problems doing that.  Now we are working on putting big caches in front of relational databases.

Stonebraker - We started storing data then we added procedures, then triggers, then queues. 

Gray And then text. 

Stonebraker - would you do better starting with a clean sheet of paper.

Gray - Postgres and MySql  and the OODB guys started with a clean sheet of paper.  What do you think of the results.  It takes a long time just to get to where the state of the art is.

Hellerstein - There is a problem making the data fit - sensor fusion – e.g. calibrating temperature measurements.

Abiteboul  - Applications of streaming - e.g. in security people watching streams of everything involved in intrusions - Process fast output of web crawlers have to do things on extremely fast streams.  He  wants research on what can be done with a given amount of memory, etc.  

Ceri – There might be kinds of queries you can do on the fly - they are monotonic and don't aggregate

Agrawal - worried about the security impact.

 

------------------------------------------------------------------------

Stonebraker - dinner at 6pm - some room in first floor - after dinner we will

come back here for the gong show 2.  2 minute passionate discussion of what they think is a great vision or challenge problem.

   mine:

      personal memex

      universal knowledge

      individuals use data themselves

tomorrow 8am start again -

 30 min vis

 60 min pdb

 60 no knobs

 30 min on trust

that's the morning

------------------------------------------------------------------------

Dave DeWitt & Hans Scheck- Laguna Beach report from Feb. 1988

  Bernstein, Dayal, DeWitt, Gawlick, Gray, Jarke, Lindsay, Lockermann,

  Maier, Neuhold, Reuter, Rowe, Schek, Schmidt, Schreffl, Stonebraker,

  Ullman (joke)

 

this was a controversial report - there was even a counter-report

Future DB applications it suggested

CASE (software-ugh), CIM (manufacturing), Images(yes), Spatial(yes), IR(yes)

future h/w environment

continue to consume hardware resources as fast as they occurred

 special purpose DB machines were a dead end - (right)

future s/w environment

                DB & OS types would continue to do battle

                "we'll be stuck with current OS interfaces just as our clients are stuck with SQL" –

the context was MVS

                4GLs would solve the PL/DBMS impedance mismatch - UGH

                                (protests - some people use things like Tomcat)

                PROLOG+DBMS will yield nothing of lasting value

                                too many groups doing the same thing

                                (objection: bought three nice houses in Madison)

extendible DB managers

                 big debate on OR vs. Toolkit (no conclusion)

                heated debate on OODBMS

                                "misguided" or "highly significant"

                lacking a clear definition of the approach

active databases and rule triggers

                strongly endorsed DBMS support for triggers, alerters, constraints with

                high performance

                no need for general recursive queries

(analogy drawn between its fate and the fate of Nth normal form) –

this was most controversial statement

end user interface

                                 universal agreement that we needed better user interfaces

                                 lamented lack of research

                                difficult to publish papers

                                reviewers hostile because they lacked graphs and equations

                                need to "demo or die"

                                lack of toolkits (e.g. X11)

single site db technology

                                hardware trends would require rethinking optimization, execution, run time

                 concurrency control dead as a research topic

                support for parallel DBMS research

                stop doing new access methods (except spatial)

distributed DBMS

                                enough research - commercialization about to come

                                only problem was administration of a system with 50,000 nodes (got this wrong)

miscellaneous topics

                                no-knobs physical db design - including index selection and load balancing

                                tables across disk arms

                                better logical design tools

                                 support for continuous streams of data

                                 no more data models please

                                data translation was a solved research problem

                                 better support for information exchange

(most people liked the report –

Ullman objected that if you're going to crap on theory you should invite someone from the theory community)

 

Hans Schek.

  has the original foils of the presentations

  everyone had 4 topics to recommend and 2 people should not work on.

  I picked those from the people who are here - Bernstein DeWitt Gawlick Gray Maier Schek Stonebraker

 

Bernstein

pro          distributed sys admin

                                TP application schemas

                                auto data translation

                                active databases

con         database machines

                                real extensible database systems (ORDB) - not research

 

Dave DeWitt

pro          dbms for scientific apps

                                CASE support by DBMS

                                optimization of queries over complex object hierarchies

                                active DBMS - whatever that means

con         general recursive queries

                                hardware sorters, filters

                                concurrency control

                                object oriented dbms that mention encapsulation

 

Dieter  Gawlick

pro          productivity & operations

                                 technology for transaction processing

                                interdisciplinary communication

                                access patterns

 

Gray

pro          procedures in db systems

                                automatic db physical design

                                disaster recovery - data and application replicated

                                10 years continuous operation

                                large or exotic db, 10^12 recs video fax sound

                                specialty databases - case, cim, geo

 

Maier

Pro          single object constraints

                                physical representation language

                                update semantics for logic DBs

                                constructive type theory

con         storing DML as strings in DB - not compositional (anti Postgres)

                                behavior-only object models (Maier says he hasn't a clue what this means)

 

Schek

pro          systematics on semantic data models, knowledge rep, ...

                                optimization mapping to kernel operations

                                host language coupling with external operations

                                tight DBMS cooperation with applications

con more recursive query processing - missing applications

 

Stonebraker

pro          integration of 4GLs prog lang and DBMSs

                                1000 node distributed DBMS

                                abolition of IR systems as one-offs - efficient text in

                                general purpose DBMS

                                end user usable application development environment

con         recursive query processing

                                interface between prolog and DBMS

 (at the time - 1988 - we were just seeing the end of the 5th generation hype)

Abiteboul  - have seen good applications of constraint logic programming - don't throw out everything.

Ullman - logic has had some impact but not in query languages

 

Gray - Let's just modernize this and ship it.

Iannis - What has been accomplished in the 15 years that was a consequence of that report, on the positive side?

Gray - no-knobs stuff, disaster recovery.

Silberschatz: Did we envisage web in 1988?  data mining?

Maier - most prescient is data streams

 

Tuesday

 

Data Visualization – Mike Stonebraker

 

Tioga system has developed into something sold by Rocket Software and shown as an example of user interface - maps with colors reflecting certain properties. Also like PAD++ has zoom-in on data : dots turning into company icons turning into company financials and stock history.  Name is Visionary.   Pan and zoom over geography.  canvases can be nested and you can put holes in canvas to see what's behind hit.  Demo for dairy farmers showing meters as well as geography for milk quality.  web-like; you can refresh but it doesn't dynamically follow the data coming in.  (Stonebraker is running on the laptop DB2, Access, SQLServer, and I think one other piece of DB software).  Uses ODBC connections to database tables - has wizards to help write SQL - can display results in layouts like chart, control, form, hierarchy, map or pattern.  e.g. hierarchy - you pick "superid" and "empid" to link supervisors to their employees - various choices to pick colors and the like - says the dairy farmer app was created in 1/2 day (by an expert). Other than Ben Shneiderman, why don't we do things like this?  (a few people objected they also did UI).  Larry Roe asked why GUIs are 2nd class citizens in 1988 and they still are.  Visualization is still done as scientific viz and not database viz.  We're missing an opportunity.

 

Croft - Same situation in IR - sprinkling of visualization papers but it is difficult to do research in this area because hard to evaluate - people show nice examples and the paper got rejected.  Also in IR very high dimensional data - hundreds of thousands of points - you can get galaxies of points but they are useless for finding things - hard to find a powerful visualization.

Gray - from Terraserver - we did prototype and then got in a graphic artist and the difference was stunning. There are three areas -- data storage & retrieval; graphics; and programming; and they're separate.

Stonebraker - the Tioga papers all got turned down. 

DeWitt - Suppose no papers - submitted the interface - at the talk you do a demo.  Would that help?

Serge - There is a man-machine interface community with their own community and they're better qualified than SIGMOD to evaluate stuff.

Stonebraker - We can keep opting out of this work but we're drawing a smaller circle around our work.

Maier - 1983 SIGMOD in San Jose - Jim Gray gave a talk - said databases had taken COBOL and shrunk it by half - Access statements became SQL queries - but the other half the code is user interface stuff still being done procedurally - so I had a student and we were doing object viewing - Xerox PARC started a similar project - they came out with much glitzier stuff - I stopped work on this - I had a head start but could not keep up with Xerox.

Stonebraker - but we do storage prototypes.

Ceri - my system is a graphical interface for the Web - to publish on the web you need something like this - we enabled people to use an existing tool.  Not our responsibility to do the layouts but to use existing tools. We shouldn't do this but help others.

Maier - another opportunity for visualization - talked to people who do volume rendering - the main memory stuff doesn't work - their papers rediscover how to lay out data on disk to do rendering and zooming well in large data sets.  You have to partner with people who do visualization algorithms.

Stonebraker - ditto for data mining.  We could make a contribution if we chose and we choose not to.

Gray - There is a mindset "it's not in main memory" - 4G of RAM cost $1K and most people's data are smaller than that.  I am working with a student who complained to me that he got 900K records/sec from the DB and he expected 1M.  In main memory things go fast.  If you don't use databases you can make vis

very hard - indices help.

Lesk: Why did you take up text but not interface?

Hellerstein - (1) we have colleagues in HCI we should talk to them about how to evaluate this.  the IR folks are better at infecting the HCI conferences than we are.  we have to learn their methodologies. (2) we lost a really important topic which is language spec - that can be done visually better than with languages - people want to point at things and show example calculations - we got papers into VLDB - we had an interesting spreadsheet - the problem is visual languages and for schema heterogeneity this is the place to start - people have to understand it.

Sellis - Suppose you had to pull in data from multiple applications on different machines?  What are the problems?

Stonebraker - This opens an ODBC connection to anything you want anyplace you want.   Does no distributed joins except on the screen.  There are 50,000 line segments in the California coast and rendering those may be slow, but getting the data is not bad.  This system is blindingly fast now.

Ceri - there are web services experts - but we should keep some of this work

inside our community because the HCI people don't understand the paradigm of easily specifying joins and data navigation.  This can be part of our field; but web rendering is not something we should tough.

Yannis - there is an advanced visual interfaces conference - not mainstream DB but it does bring the DB and HCI people together.

Stonebraker - I got pissed in 1993 and wanted to start in this area - I may be world's worst graphical artist but we got somewhere just from the idea of wormholes and moving closer to data.

Alon - there is also NLDB -  why not also talk about natural language interfaces?

Stonebraker - we used to do front ends - 4GLs - that was respectable in SIGMOD once upon a time and now it isn't.

Personal Databases  Gerhard Weikum & Jim Gray, Martin Kersten

Gerhard Weikum

Simple case: only look at what I have on my PC

basis: email archives, memos, programs you wrote or ran, photos taken, Web pages ever visited - for last 30 years.

 

scenarios:

   who was the French woman I met at a prog.comm. meeting where Peter Gray was chair?

   when I did think of the idea for this paper and how can I prove that?

   which book did I read while on the flight to my first VLDB 20 years ago?

difficulties/opportunities

substitute data sources for your own memory

query by associative memory - approx time, place, person, institution,  anecdotal events

                transitive closure over combination of similarity predicates

                lazy and sloppy info organization (e.g. folder names)

                automatic annotation; named entity recognition

                 automatic and evolving classification

very long time span with changing terminology & interpretation

                 classification, authority, etc. needs to be time-aware

                 understand that "stream" might be environmental at one time, sensors later

 

 personalized interpretation and bias of terminology evolving over very long time span.  

 continuous learning & re-learning of preferences & biases from user interactions

 [this is heavily about browsing more than detailed searching][

storing all this stuff is a solved issue.

 

Maier - You missed format conversion:  “I went back looking for 1994 PowerPoint for my algorithms class - I have troff documents linked to particular files.” Its  not just saving the software but reliance on libraries and scripts and so on.  How do you capture the knowledge - it is application-owned.  Having it around in the future is a problem.

Weikum: Two standard answers - migration and emulation.

Agrawal - Ray Lorie  has a project to do this - keep things around 100 years.

Lesk - Why don't indexing file systems catch on?

Agrawal - is there a study for Microsoft Exchange? we looked at Lotus Notes - more than 30% just leave their email in their inbox.

Weikum - Users are lazy

Lesk - why shouldn't you leave the searching to the computer?

 

Gray - Personal Memex.

Gordon Bell wrote a paper "dear appy" my programs and data won't talk to each other any more.  She writes back and says you have to put your stuff in a gold standard - which means something that has $1B behind it - Ascii, TIFF, PDF ok, HTML is possible, Postscript not.  And you have the media problem - floppy disk or paper tape.

 

Vannevar Bush - memex paper - imagine trying to fill a terabyte in a year

  300 KB JPEG - 3.6M items/TB - 9800 items/day

  1 MB doc      1 M             2900 items/day

  1 hr 256kb/sec MP3  9.3K      26 items/day

  1 hr video                    4

 

Gordon Bell MyLife Bits

20K pages tiff                      1 GB

2k tacks music                                     7 GB

photos 13K images             2GB

video 10 hrs                           3GB

docs  3k (ppt wd)                  2GB

Mail 100k messages           3GB

                                                18GB total

Now recording everything - all conversations, etc. He named everything on the way in - big folders. He couldn't find anything. Gray tried to build a database app but he refused. What has worked is to simply put everything in one database - You search on the text in the annotations.  Google does not organize the internet as a collection of folders.

There is a timeline view, showing thumbnails - also parallel screens of personal and work timelines.

They don't do the face recognition - just do the text inversion and searching. They ocr the papers.  They can do the search and render the result fast – only thousands of things retrieved, usually.

AskMSR - automatic question answering.  mines the web

'when was Abraham Lincoln shot'   ok

'what color is grass'             ok

'why is sky blue'                  ok

'what is the meaning of life'      ok

'why did aliens first land on earth; (they crash landed)

'where is Osama bin Laden' (in the mountains of Afghanistan)

text retrieval is doing very good job.

images, scientific data - not doing well.  SQL is our strategy for wandering through scientific data.

 

Weikum - Wouldn't XML be even better - e.g. know Gordon Bell from a ringing bell.

Gray - The text people are doing this - and the image people are starting to learn how to annotate.

Weikum - So we should team up with people from the other community.

Gray - We can build the plumbing to help them.

Hellerstein - Most people don't have scientific data in their files. This is well in hand - aside from questions like why don't Microsoft apps talk to Apple apps.

Bernstein - System engineering of very large components - consider annotation - We have people whose career is reading large volumes of text and turning it into something formal, one sentence at a time - this is a career - Here we are engineering something which brings together many of those components - The visualization is also part of this - Most of the folks I talk to are enthusiastic about seeing technology applied - The machine learning guys won't stop to work on schema matching or helping with annotation.  We are the systems engineers for anything that involves information management.

Gray - We built a database photostore and gave it to the graphics guys and they did nothing - Once we built this they saw way to use face recognition.

Franklin - This is a great opportunity - you could federate for groups or companies or families - tip of the iceberg.

Stonebraker - if you enlarge the sandbox a little bit - my wife dents her pick on financial records across Schwab, bank, limited partnerships, Quicken, etc.,  - We are not overwhelmed by Gordon's problem but by personal financial management.

Hellerstein - the elderly need this more, including in particular their health records.

 

Kersten - organic databases to support an ambient world

  disappearing DBMS - picture of IBM old tape drives

  now we see that there are small gadgets not big iron

He got a call from Philips - planning more gadgets - dream of ambient world - The gadgets are all hidden - the TV is in the wall and the controls are in the basement.  but all the remote controls now have knowledge - How do we do the data management?  We need appliances with knowledge.  We now have light fixtures that knows date and ambient light.  We have a bathroom mirror that will display stock prices for me and a cartoon for a kid.

 

The database system is gone

  data management can be left to the individual applications

  no need for scaled-down SQL DBMS

  data management doesn't need a DBMS.

 

Phase 1

  DBMS is hidden in the wall.  big server in basement.  no good – Philips does not sell PCs, it sells toothbrushes, TVs, etc.

Phase 2

  every product has its own data.  what sensors does it need?  how communicate?

  how have multi-year backward capability?

 

characteristics needed

  self-descriptiveness: outsider can access, interpret, re-use the schema

  resource requirements re explicitly stated

  software version trail is available to let you time travel

  code-base of the manager part of the store itself

 

Lesk - does Philips really want to let other people access their schemas?

Martin - yes - Europe is more open.

 

self-organizing

   can split into subsystems with minimal synchronization requirements

   systems can fuse easily - user resolves conflicts

   roll-forward over schema updates, storage optimization

   database can migrate or move to another location.

 

self-repair

   runnable system can be obtained on a new platform with minimal bootstrap

     new toothbrush picks up data from old one or somewhere

   software setup so that a bug can be "fixed" by locking-out part of the code

     without losing all the rest of the functionality

   replicated storage/indexing to recover from failure

   manage a trail of database version

 

self-aware

   security aware - authenticate environment (toothbrush recognizes your

     fingerprint on the button)

   location aware

   time-aware.  should be able to manually back up in time.

 

grand challenge for the 21st century

   organize database management system which can be embedded in a wide

   collection of hardware and is autonomous, self-descriptive, self-organizing,

   self-repairable, self-aware, and stable data store recall functionality.

 

Garcia Molina: Sony, Toshiba, etc can't agree on DVD-R standard.  Will they agree to this?

Martin: Philips is the only company which has to agree to this.  There may have to be some bargains with Sony but they are optional.  Now thinking about sensors in watches which measure blood pressure and temperature which go to a device that figures out whether you are stressed and changes the music on your radio.

Stonebraker - it is not fair to underestimate heterogeneous human integration problem - my thermostat and home security already are these things - just doing it for Philips will not work.

Martin: but it points where to go.

Hellerstein: This was great - we are missing at Ubicomp and that's a shame.

Bernstein - There are areas that have come and gone - heterogeneous databases looks like a career that will still be here in 20-30 years.

Ullman - some years ago I was on an ISAT study that sounded like this - was a bunch of AI people who wanted distributed active objects – The study went nowhere but they did not see it as a database problem.   

Martin - The Philips problem is not all DB.  But we have a place – supporting the data needed for people.

Franklin - This is happening even more in cars.

Avi - The networking guys want an IP address for every device.

Schek – This is a very important area - looking into applications of hyper-database. in medical environment.  The medical people are already here.  We have to support them.

Weikum - Not such a big problem of data integration - paradigm is e-services sending messages to devices - they have to share some data but not that much.

No Knobs: Dieter Gawlick?

auto tuning

   dynamic allocation of existing resources

     integrate with failure reaction

   low variance in response time

   advisors for need/impact of new resources

 

single/multiple databases

   two comp centers, few blade computers in each

   recent years - now computers cheaper by factor of 10

     so people now minimize the number of comp centers and

     two is the right number.

   most engineering -

      all the vendors working on this - keeping some knobs, use optional

 

Bernstein: - Look at how tree balancing went away when B-trees were invented - the knobs just vanished.

 

auto tuning

  what to do with a bad query/program - reports are not the solution

  80% case -- how do we know that we are at 80% of the benefit for 20%

    of the effort?  maybe I am only at 0.1% of optimum.

 

tradeoffs

  which application/task/user is most important

  business value of application - can't do this automatically

  rules and regulation

 do we have high level abstractions for these?

 

selecting functionality

  which features?  which preferences?  how much security?

  need to link to the tradeoff policies.

discovery

  low level discovery - broken disks, new disks, blades, etc.

  high level discovery - semantic transformations

 

Stonebraker - Why is this so slow to arrive?  Vendors seem slow.

Gray - Simplicity is a feature, and it fights other features.

Dieter - Of the illities - reliability, security etc. - simplicity is not the top priority.  But now it is getting attention.

Weikum - Workloads are more dynamic now.

Garcia Molina: We need a benchmark - "no knobs 10" means a ten year old that can install the database system.

Gray: Patterson has a group measuring how long it takes to repair a failure injected into a RAID system.  they measure % success and time.

Hellerstein - They did the same thing for databases - those were pretty good at surviving failures.

Bernstein: - one problem making systems simple enough to be auto-tunable is that the systems are tunable - Even the experts don't understand the implications of new features.  It’s not just a matter of good tuning algorithms but a way to model the system well.  Engineers must understand the tuning implications of new features they add.  We don't model systems well.

Gawlick - I agree.  we don't put in any feature that breaks reliability or security.   What are the impacts of things like streaming?  We didn't know well enough what this would mean.

Gray - Cost of managing computers vastly exceeds the cost of computers. In the MSN area the staff is reluctant to   manage a database; they understand files, they don't understand databases.  We're struggling with making databases as simple as files - it requires a new model.

Lesk: there are two choices.  My first color screen had 27 adjustment knobs on it - the vendors learned how to do self-adjust.  My unix systems used to give me a choice of file block size.  That went away because the performance hit is now acceptable.  You're only looking at the first answer, not the second (ignore the choice).

Haas - We are now doing that.  Watch and our knobs will go away.

Bernstein - $4/GB/month to do storage device management.  and this is 5 times cheaper than the outsiders charge.  Much more than the disk costs.

Gawlick - Recovery, security, and drive down cost.

Stonebraker - Jim Gray had Tandem numbers 10-12 years ago and 80-90% of crashes were operator error - goal is get rid of operator - you shouldn't leave the knobs in.  Customers would rather have reliability than to allow computer jockeys to turn knobs.

Gawlick - That was a different time.  All of this stuff is gone.  You can no longer delete a table and be unable to get it back.

Gray - There is an operations phase and an administrative phase.  We can automate operations, but not administration - we need wizards and profiles.

Ceri - Goal-oriented tuning?

Gawlick - Yes.  Set an objective.  Response time not more than X seconds. set a policy, something else tunes.

Privacy and Security Joe Hellerstein taken from an IBM Almaden workshop organized by Rakesh Agrawal

 

whose privacy?  whose security?

  individuals, organization, government, or society?

traditionally

  access control

  views (need to know)

  roles, not people

  but now add:

     serious adversaries: MIT students bought used disks and used shareware

      disk recovery - one was from an ATM machine, another from a pharmacy

     long timescales - in 25 years will your data be private?  will you still

      remember your password

     scale - lots of people, with rights and access, many info gatherers

      cross-source data integration : 1+1 can be >>2.  even if Census has

         rules and CDC has rules together they may not be strong enough

      people care a lot, more in recent years

issues

 managing data use

 trust relationships

 transparency

 incentives

 mechanisms

 goals, metrics

 

primary & secondary use

  Prozac fiasco - you got reminders to tell you to take your drug - then

   the marketing folks sent out a mailing about a new offer - users felt

   their privacy had been breached

  traffic light cameras for red-light runners detecting speeders - raised

   a problem.

  specification of purpose for data, and how enforce?

 

trust & relationships

 two sorts of trust

   policy adherence trust - enforced or audited

   relationship trust - may only be loosely related to policies

 changes in relationships are problem - merger/acquisition

 

transparency

 of use

   is the policy crisp and comprehensible

 of disclosure - do you know what you gave up?  is the info on the magstripe

   on your drivers license the same as what is printed on the front

 of extraction - how do I know what is taken?  e.g. swiping a card may prove you

   are >21 but it can time and location stamp you.  how do I know that the

   voting booth did the right thing

 of data destruction - can you promise this?  some people said it's just too

   hard to ensure bits are gone

 

incentives

 Economic - may make sense, graduated rather than boolean

   shopping cards - claim people don't care.

   privacy is not fungible - my privacy is worth more to me than you

 costs of privacy

   dollar costs - claim black market value of identity is $60/person.

   frictional costs to business

   cost vs. usability : people in human rights work in foreign countries,

     whose life may be in danger, often don't have encryption tools

 

mechanisms

  authorization vs. accountability

   enforcement in computing science sense vs. law enforcement (if somebody

    does this you catch them)

   accountability - catching the bad guys - claimed to scale better

 graceful degradation

   should you avoid a single point of failure that leaks all your data forever

   would you prefer loss to leakage

 human factor - biggest hole is human

   human leaks

   key management

   long timescales

 

goals and metrics

  store my data forever

     not necessarily - as long as I want it and no longer

  enforce my policy forever -

     well if I’ve been in a car accident perhaps my medical records should

     be available

  ease of use - but how measure

  problem statements here tricky

 

chart: target user (individual -org-gov-soc) vs. approach (legislation,

incentives, enforcement mechanisms).

we haven't done a lot of work about economic incentives.

 

Stonebraker - Identity theft is an unbelievable problem - I was a victim - your life is changed forever - Huge hassle - If there is less privacy identity theft will be harder - There are social consequences of privacy which are not a good idea.  Right now if every agency that granted credit had my fingerprints identity theft would be hard, but they don't.  This kind of thing is resisted on privacy grounds but commerce implications are serious and fall back to individuals.

Hellerstein - Identifiers are one piece of the problem.

Lesk: contrast between laws in Europe and here?

Hellerstein - It is long and boring.

Weikum - Europe has laws which disallow transitive forwarding - data is stored only for one purpose.  Identity faking not as bad.

Bernstein: - large institutions face problem of knowing what they have and what they are allowed to do with it.  Big metadata problem.

Hellerstein -we have this thing called p3p but it's complicated.

Avi - some people say there is no privacy so why worry.

Hellerstein - book "transparent society" - if everyone knew everything we'd be better off, like a small town.

Stonebraker - your phone records are owned by phone records - no law prevents them from doing whatever they want with them.

Avi - there must be I can't get access even when I worked for AT&T.  What about E-Z Pass.

Stonebraker - Right.  Utility bills belong to gas company.  None of this is covered by restrictions.

Pazzani - phone companies can not use records of who you called to market things to you.

Hellerstein - Rakesh filled three days with this.  Targeted my panel with what are the research topics?

Ceri - how can I change the policy?  or get an exception.

Croft- CSTB has produced a couple of reports about these - mechanisms are around.  What should the DB chapter say.  Most of these issues are what CS people talk about in general.

Hellerstein - We have declarative interfaces - queries - I don't know how the problems change when you secure objects, not collections.  For example identifying individuals in aggregation queries.  we own some of the most important systems.

Garcia Molina - do you have a report from Rakesh's workshop?

Agrawal - do “google Almaden institute 2003” - that will be the website – most talks are online.  First day was policy makers - then people from industry including Kaiser, State Farm, MSN, and then we had somebody from Newsweek. Larry Lessig gave an evening talk.  Thus getting requirements.  2nd day was technology day - Diffie, Schneier - current snapshot of technology - what is happening in data systems, lessons to learn, Christos talked. 3rd day had 3 workgroups tried to summarize.  (1) requirements, (2) technology, (3) what are the Hilbert problems in this space?  It also all on video.  Perhaps we can give out DVDs.

 

Gray - Digital rights management issue drives policy of how you can use info; Huge economic incentive to control this - dovetails with this - They have clear ideas about what kind of policy - They want very fine control.  If they get their way there will be a mechanism to control your information too.

Hellerstein - This is about copying - one dimension - there are other problems.

Rakesh - at WWW conference same discussion - entertainment industry would  prefer not to let people make computers.

Gray - Disappearing ink company - you can only read your mail at their website. Genre here is kill off copies.  Revoke keys.  You can not do this transitively.

Hellerstein - record companies don't get this - they focus on initial digital version.

Gray - watermarking technology.

Lesk - Make things easy to use - social engineering

Garcia-Molina - I wish people knew more about me so they didn't send me irrelevant spam.

Hellerstein - Are there mechanisms to put in the report - points in problem space.

Agrawal - This is the IBM centric viewpoint - absolute privacy has too high a cost - makes our lives harder - Some people only use cash - that fraction is coming down.  About 20%-25% of people don't care.  Large fraction are interested - care a bit but are pragmatic - e.g. personalizing email is fine but that means I have to give out some information.  IBM wants to design for this pragmatic majority.  The purpose is also not static -- this is a declarative specification problem we have to be good at.  Query modification problem.  What it is doing is there is a policy stored in the database as metadata.  Each query consults this metadata as well as user preferences. Authorization done through query rewriting rather than access control.  What to do if the disk can be hijacked?  We can encrypt it but then range queries run very slow.  Some negative results - you can't do arithmetic and comparison with safe encryption.  People focus on trying for encryption schemes that allow arithmetic but they missed allowing aggregation which can allow indexing over encrypted data. Notion of retention - do we keep the data forever?  Or get rid of things when we don't need them. There are a whole lot of interesting problems - how to do compliance. How do you give user faith that you're following the policy?  What if somebody files a lawsuit, how do you prove you followed the policy?  Shat if the data are exported?  If we can push on the data side we might help people like Larry Lessig coming from the legal side.  (His vision paper was in VLDB last year).

Bernstein - see Lessig's books - they show how the definition of software defines the possibilities in terms of the legalities that can be enforced or expressed - programmers are making law when they set up a service or a database.

Franklin - this has to be in the report of the meeting.  We have a big responsibility as data managers - this is clearly in our court. 

Garcia Molina -yes.

Ullman - Some queries we can't do in encrypted form.  I think there is more hope than you imagine. Suppose you appear in different databases as  J. Hellerstein or Joe Hellerstein or Joey Hellerstein - in plain text we can write a little routine that folds these - perhaps locality sensitive hashing will enable some of this - but you can get some leverage in the encrypted space.  Not as secure but there may be a tradeoff.

Hellerstein - Location sensitive hashing is dangerous - you can insert things and see where they fall and zero in.

Agrawal - if you have order preserving encryption you're leaking info. We need to describe what is the security model?

Gray - Use Palladium - it is in clear in the cpu but not in the memory or on the disk.  The cpu sees the world as if it were clear but the cache lines are encrypted.

Agrawal - IBM mainframes support encryption and the DB people have not been able to figure out how to use it.

Franklin - well CPUs are 90% unused so perhaps we can afford to buy more disk arms and do scans and use the cpu power in exchange for security.

The Report: Stonebraker

 

Stonebraker - we could sketch the report - use lunch to refine this.  Regroup after lunch. 

 

Context:

  we have to write one

  we can't say "we got it right in 1997" (or 98)

 

proposal

  make up a 50 year challenge with a 5-10 year milestone

  "highest pole in the tent"

 

tone

  information management focus

  networks and cpus are unimportant tools

 

reasonable tents

  deep web

  personal dbms

  query the internet

 

example

  "integrate the deep web"

       structured data on the web

       your home security system and your toothbrush are on the web

     new data model and query language for text, streams, structured data,

       and triggers. 

     heterogeneous federated DB

       a million sites

       semi-automatic wrappers

       "finding" problem - who has this?

       dynamic federation - Martin talked about sites coming and going

       schema choosing - Alon wants to add things picking my friend's schema

     self repair, self-adapt, new GUIs, privacy

     superset of TIA problem

      "tell me something interesting"

 

  short term goals

     get a benchmark/test harness

     have a serious pow-wow with IR

     play out the separate stream system debate

     design next version of Xstuff

     get somewhere on "tell me something interesting" - challenge data mining

     get somewhere on wrapper technology - how build all these automatically

 

Hans Schek - you left out multimedia. 

Stonebraker - have everyone spend lunch improving this.

Jennifer - can you switch the top level label - it doesn't have much to do with deep web.  I like the bullets.

Bernstein - Get rid of top grand challenge bullet - there are grand challenges they have a common info mgt problem.  Our problem is the common part (75%) of everyone else's grand challenge.  Everyone likes this.

Gray  - We should reduce this to 25% - people won't swallow 75%.

Bernstein - We're going to have a lot of grand challenges over the next 5 years in CS - we shouldn't be one in the pile but common to all.

Stonebraker - Phil & I will write report.

Pazzani - federated data bases sounds like a 15-year old term - if you mean semantic heterogeneity it will sound better.  Distributed databases sounds like bigger iron - this is something different.

Stonebraker – Lets spend 40 minutes on brainstorming. I will write a complete draft and then Phil will fix it.  Then we'll circulate and you can whine about it.

Bernstein - a) We need to explain the DB field - we have some core competencies in access methods and query optimization, transactions, schema management. We are driven by bigger disks and faster networks.  Related technologies can make major contributors to machine learning, graphics, etc we are the integrators and learn how to apply them to info management.  Iannis was talking about raising the level of abstract by putting those three components together (?what 3). b) This is the fifth such report we need to look back at earlier reports - not just what changed but show. What problems are long term and which have been solved or are under control.

DeWitt - do you want to review the past recommendations and say how this fits or what?

Bernstein - mostly thinking about the way some problems recur and some have dropped off the road map.  We are uniquely positioned - many people in grand challenge mode think we're in good shape and can look back and talk about what we've learned from 20 years of this.

Marten - We should start with 98 report.  Repeat at least one recommendation.

Bernstein - Previous report was information utility - that's almost a satisfactory label.

Maier - Federation then might have been n=3 and is now n=10k.

Schek - Huge change in technology - ubiquitous computing - where is the data now - we need a new grand challenge.

Marten - The hardware is disappearing.

Hellerstein - Perhaps do the history separately - might distract too much and take too much space.

Gray - Phil only wanted two paragraphs.  The really new thing is concern about privacy.

Marten It was a hot topic in early 80s.

Gray - did not make it into our reports.

Bernstein - in 1980s we knocked off good problems one after another - replication, optimization; heterogeneous data was on the list then too.  Basic transaction management has gone.

Croft - Talk about how the process can be facilitated.

Stonebraker - we need some short term milestones.  The annual text exercise has been good for  IR

Marten - TREC-like thing e.g. 2004 "year of privacy in databases" and have a conference.

Bernstein - Rick proposed a journal of data sets.  produce challenge examples.

Maier:  the thousand database corpus.

Gray don't do a 1000 - do 10 - it is hard to curate them - 1000 will diffuse focus.  10 will be better.

Ullman - we wanted to take all the CS dept databases - but there was a big privacy problem.  I wonder if we could get that.

Stonebraker - every CS dept has public databases - e.g. spring schedule of classes. 

Ullman - what is easy to get is all web data.  if that's the entire benchmark it might not be enough.  You really want some spreadsheets and files.

Gray - we want a dozen or so.  don't design it here.

Croft - TREC comes from a group getting together - the top web page says "how to suggest a new track". It helps that you have funders.  NIST administers it. There are 8 or 9 tracks now.  (Some terminate - there may be a dozen total).

Pazzani - Irvine collected machine learning databases - one guy just got in people's face when they said they had one.

lesk - Something related to 9/11

Jeff - Query optimization for distributed heterogeneous queries.

Gray - What is the milestone?

Jeff - Hard to phrase as percentages of improvement.

Stonebraker - Test harness - want a single performance metric.  Jim's benchmarks all come with a single number like Sequoia benchmark has 20 queries and progress on them is measurable. Astronomy is like that.

Jeff - another 5 yr milestone is a new language which integrates structured, semi-structured, and text data.

Marten - Another is an open source database kernel.

Widom - What's wrong with MySQL?

Franklin - A lot of people see open source databases - If we are not involved it hurts our community.

Stonebraker - IBM owns illustra/postgres code line - and has no use for it - can you guys get PR by making this open source.

Laura - I asked that and got an answer which makes me think we don't own it.

Gray - when you publish your data you will publish in some database - we need an open source way to do that.

Maier - Why do I need the database?

Gray - What just comma separated values?  I publish Sloan stuff on a disk with procedures.  I  propose that if we publish databases and the methods that encapsulate the objects you need an open source database to do that.

Hellerstein - What about just XML dumps of the data?

Gray - How do I get the procedures? There is no standard source procedure language.

Stonebraker - the wrong way to do this testbed is to publish some data sets - each of us have to donate a machine or a piece of a machine that is accessible and running something we can get at.  so we need 15 volunteers.

Hellerstein - Intel will volunteer to help.   PlanetLab community.

Gray - Grid guys will too.

Serge - presenting as a challenge a data model and a query language for semi- structured data doesn't seem like from this century.  last century.  we need sensors, updates, etc.

Pazzani - try to understand what's changed.  Semantic heterogeneity has been around for a while but some people say they are making more progress.  There is a lot more scientific data around.

Zdonik?  5 year goal involving outreach to other communities.

Croft - We ship IT testbeds or toolkits - we never use a database system.

Bernstein - At least 3 versions of outreach - db technology for others, more collaboration databases plus technologies, or actual applications.  The scientific database area is an expanding trend. If this report is to be read by outsiders statements that we are moving this way would help.

Franklin - in five years we'll say we missed bioinformatics.

Stonebraker - politically and technically researchers should find app areas and help them solve it.  We pay lip service to that; Jim has been doing it.

Bernstein - Lots of people are doing it.  Dave Maier and I were at NLM and a lot of people at Michigan and Penn are doing this - it is not just biologists learning databases

Alon - How to make us upward compatible with all the grand challenges – How make this concrete - we provide some services to all of these challenges - streaming data, federating data, etc.  Define as services you can expect from our community.

Lesk - The “find Tony Blair” example - get his face & voice off the web.  Show we noticed 9/11.

Ceri -  pose a query  - find the sources - they may change depending on context and then all these issues of caching come to play.  Exporting particular expertise with optimization to the global infrastructure or Hans' hyperdatabase is measurable and will happen in 5 years.

Gray - Archiving - come up with representations that are likely to last for a century. Privacy - follow up on the template Hellerstein laid out – policies and enforcement - tools for enforcers and policymakers.

Maier - Asilomar 2103 should be able to read the slides from this meeting.

Rakesh - One major problem is given a database find what is personally identifiable.  What fields matter.

DeWitt - Either as a milestone or challenge we should make progress on getting scientists to stop using filesystems.  (too negative) so using databases.

Marten - We need younger people at this meeting - sliding window.

Yannis - Antiprivacy direction - In 5 years should be able to have personalized behavior from databases - you access a public database and it behaviors differently.

Croft - Database courses are morphing into things from other disciplines - should we endorse this.

Jeff - I don't believe it - it's not happening except at Berkeley – Database systems were a long trek getting to where a database course was a serious piece of a CS program and I don't want to give that out.

Franklin We should teach courses in addition that show how to use things from other disciplines.

Gawick - secure database - 5 year challenge.

Avi - one issue we should worry about is that when we combine data bases how do we know the result is meaningful.  We have no idea about the places we are getting data from - we need to do some math about the reliability of results.

Bernstein - if we believe that bringing in other information management fields is important we should endorse additional support for a broader range of topics in DB research conferences. Do people like what's happening at VLDB?

Hellerstein - VLDB has been saying really weird things but a lot of papers at SIGMOD and VLDB have little bits of statistics or fractals.  People are sensitized to this.  The VDLB extra tracks are e-commerce apps and that would not be in a curriculum.

DeWitt - This is the year of XML.  (VLDB this year has more than 10 XML papers)

Agrawal  - In five years it will be a good test to say "tell me something interesting from your data". 

Stonebraker The meeting is winding down.  Send your slides to Jim.  I will send notes around.

Jim will have a website for the workshop.