Asilomar in Lowell - May 5 2003

Raw notes by Mike Lesk

 

Gong show Sunday Night Hans Schek Moderator

Hans Schek - new infrastructure for information space - this is thousands or millions of databases and computational services that do multimedia and text classification, term extraction, an architecture with lots of services and I call the new architecture "hyperdatabase" - databases can sit on all our devices.  Analogy - a database is a platform for developing applications on shared data - hyperdatabase is a platform for developing services; instead of indexing we need feature extraction.  Research area not just for us.

 

Phil Bernstein – Drivers of research topics: exploit information systems technology, new ideas for classical DB problems, grand challenges, combine DB & non-DB technology, application. Build knowledge-based systems that work: more logical inferencing, richer knowledge representation, engineer combination of AI & DB.  Build experiment management systems: data management for experimental investigation, for science and engineering.  Requires rapid schema evolution; versioning data, schemas and types; heavy use of rich data types and data mining, integration with portal management.

 

Mike Franklin - working on suite of projects about query processing in strange and interesting environments - Telegraph with Joe - stream processing how do this with lots of sharing, adaptive processing, and sensor networks - how push this out into the network - lots more interesting work to be done.  Errors, lost messages, nature of sensing the environment.  XML broker - how process large numbers of Xpath and Xquery queries - 10s of thousands or 100s of thousands.  Also applying query processing techniques to the Grid.  Make it more interactive and more easily programmable.

 

Bruce Croft -  talk from recent ICDE conference - developing new probabilistic model of retrieval - applied to cross-language retrieval, image retrieval, and tighter integration with speech recognition, and MT.  Just started working on pushing that to do retrieval in semi-structured IR domain.  Can we provide IR API for semistructured database?

 

Joe Hellerstein - bringing data independence to networking - network is not just moving packets like post office- trying to write intelligent programs on top of a very volatile system - programs must be robust against nodes coming and going - we're doing sensor networks and peer to peer processing - convergence between graph algorithms and query optimization - we need adaptive algorithms - I enjoy both algorithms and building systems - having more fun than in several years.

 

Jeff Ullman - group should address TIA problem - linked discovery or chains of discovery in multiple databases - the technical problem excites me. Query optimization is not the same for streams as for traditional databases - the new stuff with XML is not just the same as SQL optimization.

 

Rick Snodgrass - methodological basis for our field - now hampered by our methodology - in science knowledge is encoded in theories - scientific theories are testable and make predictions - our basis is twofold (a) how about it, (b) if you need better performance we'll put something on.  We test on a few data points.  We don't have scientific models.  We need a list of needed scientific models.  I have 4 suggestions which I sent out in email - can't do this in a few seconds.  Each model is predictive and testable.

 

Avi Silberschatz - many years ago I had a dream - my laptop was a database machine but universal access to all data in the same way - some people in Stony Brook had the same goal - the database would sit below the operating system.  Don’t want to have to remember things like "lpq".  All the data in the world will sit in some form of database with universal access to it. Lots of research issues- would be great to accomplish this.

 

Mike Carey - stopping doing research a few years ago - I’m in industry waiting for problems to come to me - working on XML - adding workflow to Xquery so it can do data transformations - now using XML schemas to think about your data and Xquery to express integration want to integrate services and data.

 

Alon Halevy - my goal is to get people to stop complaints about semantic heterogeneity - want to automatically match between objects in different databases - experts use names and values - but over time they see lots of schemas and they get good at this. We’re using a big corpus to learn things-e.g. typical attributes for a field named "student" and using patterns of this sort to match between different schemas and reformulate queries for a database we don't know anything about.  This is part of our idea of how to do Google of 10K databases; we reformulate your query.  More generally cross the "structure chasm" between IR and DB world - make it easier for people to author and query data.

 

Rakesh Agrawal - how can we make our data systems more privacy aware - can we design info systems which will be sensitive to privacy and data ownership but not impede flow of information.  Two primary drivers - technology is too invasive in wrong hands - build in antidotes; and new business models that require cooperation plus national security requiring need-to-know for sharing; and the underlying technologies for these are likely to be similar.

 

Mike Pazzani - I'm here to listen - want to hear things that will increase the NSF budget in this area - so one mission is to ensure progress of science and self-generated issues from DB community might lead to 5% increases; this year it might be 15% - to the extent you find interesting publishable problems which help people in the geosciences or biosciences there may be double digit increases - from $9M this year - take things like making it easier to get data into a database if it is not text - if it is a time series or images or chemical models - people have no idea how to do this and they're using undergraduates toolkits to make this easy might lead to increased funding. Second semantic heterogeneity is important - Atkins report says some scientists spend 75% of their time moving data from one format to another and if you helped with that the scientists would love you.  I coordinate our relations with homeland security but TIA is not that different from finding all data about SARS or West Nile.

 

Jim Gray - we do information management; databases are a core part.  We don't do face rec or voice recognition - we don't know data type details - we are good at indexing and storing things and getting them back.  So our top level comment is we do information management.  There are 3,000 people at Redmond trying to integrate things on the desktop - someday if you buy some Microsoft software, Avi, you may get something like what you want.  3 people in my group are doing a poor man's version of this.  Copying everything you do into a database.  No file folder hierarchy - you can pivot on anything - big fans of annotation but computers do the annotation - automatic face recognition and OCR, for example.  Our task is to store the annotations.  The my life project is incredibly good for press attention - Gordon Bell has put his life into this and as he gets older it's a big help for your memory.  Work on the world-wide telescope - put all astronomy data online - let people ask questions and this is the first distributed database I've really found - web databases are great - they ask a query and back come a heterogeneous collection of records.

 

Jennifer Widom - an extremely specific problem which will force us to think about some interesting things - keyword search over XML and have it give the right answer. (she was willing to stop there)  IR-DB thing.  a challenge to deal with semistructured data - deal with metadata - it all comes out if you IR like search over XML which has data/metadata mixed together and we don't know what the data really mean - if you go to XML and give keywords this also has to do with ranking, probabilistic reasoning, and it all comes together. (Schek some people in Europe do this)

 

Stefano Ceri- most of the web systems publish some information which comes from databases and publish some dynamic data  I work on principles which let you build some of these systems - I envisage a future where we could teach how to build web applications.  up to our community.

 

Stan Zdonik - stream processing  systems - addition of quality of service makes them different and unique.  by having quality of service it becomes more difficult Mike Franklin and I worked for a long time - used profiles - tried to understand user needs and application needs - if we want to move to autonomic systems we need to understand how workloads categorize different kinds of information and the QoS part of stream processing helps there

 

Dave Maier - stream stuff is 1/3 of my time, 1/3 is putting your own structure on information, e.g. re-using attention, applications to personal data management, 1/3 is looking at data product management - from scientific data - observed and simulated data, and the fourth 1/3 is distributed query processing, data theory hybrids, trade latency for need to maintain distributed state, have to have some way to talk about catalog and coverage information, not just what kinds of data sources have but what coverage.

 

Dieter Gawlick  - stream processing, started with conventional data bases, 90% of our work is done.  we have sophisticated use of information distribution so people can get a response when they are not around.  we did some interesting stuff expressionless data? - Demand analysis - you have something, who wants to have it.  Oracle streams.  all in products.   as we go forward we have some ideas I'm looking for business as I do updates in a DB I get a stream.  we don't start from events we think of everything as a history the sweet spot for a query is when something comes into the history - just what is on the brink of history.  doing this demand analysis leads to a different model.  I subscribe to a publication and it appears.

 

Gerhard Weikum - working on French woman query and others form this morning working with DB technology plus machine learning and ontologies are also an asset we would like to exploit.

 

Martin Kersten - battle between multimedia DB and database kernel guys; the media guys have thousands of hours of media and they run their jobs for hours - they need array based functionality.  the database kernel people are battling the hardware - we can't get to the data fast enough the CPU is idle 90% of the time so what can we throw out - we toss out hashing, random access, go to streamed processing - a new generation of DB kernels will be 10X faster.

 

Abiteboul - web is a large knowledge base - worked on that - but the web is not yet XML - We should contribute to turning the Web into a large knowledge base. Precise questions get precise answers and not list of documents -- second problem is mixing XML docs with web services - exchange information that is static or dynamic web services – super-excited bringing XML with embedded service calls, active data bases, query processing, tree structured data, and something very important is recursive query processing.  to give you a flavor when you use Kazaa to find records you are doing recursive query processing on the web - not efficient - beautiful technology, now we have a use for it.

 

Timos Selis- formalize ETL by picking the right operators - if you have documents in thematic hierarchies - where does this doc sit where you find it ranking should take account of this - ontology information, semantics, technical part of the work is managing catalogs - building new thematic directories - extend search to use structural properties on top of keywords

 

Dave DeWitt - I used to do optimization for Xquery but I think that's impossible - at the end of my query I'm interested in a few hard core issues - what do we do with terabyte disk drives which make parallelism hard and bandwidth per drive is poor.  we have to treat them as tape drives. optimization: queries are too complex, statistics are not good enough - look at what is in the buffer pool, what pages or tuples, push adaptive  optimization, I’m hard core db the web stuff is interesting but not for me.  me - the Vannevar Bush dream and Jim Gray - everything online  - we have the text - but we need help on the numbers - ordinary people should be able to add info and get it back

 

Hector Garcia-Molina - you have lots of systems interacting but they are autonomous - why will they cooperate - what is the incentive to share or cooperate, even forward messages let alone queries or results.  have to think about how systems get incentivized and why we should trust them. lots of interesting problems there.

 

Laura Haas - I don't really do research any more - very interested in large scale systems - integrate information from diverse systems - range from federated to big warehouses using ETL - space with caching and replication in the middle - interesting physical DB design issue - also crosses organizational boundaries - data placement problems - also need systems to be more dynamic - now they are statically configured and an expert sets them up - want to automatically map to different data sources - the whole grid thing - service interfaces - potential for us - how we can use the kinds of services that are provided by the grid-op-sys people - as they provide security services or accounting/billing our systems should open up and do things like that.

 

Mike Stonebraker - my grand vision in 1974 was System R.  over the last 25 years we have added (a) code in databases - postgres added code (b) spatial data, arrays and new data types - another add-on (c) active data bases - glued on triggers, they are second class citizens, (d) text - object relational systems we put on text but there is no ranking or probabilistic stuff (e) queue management added but not enough for streams (f) parallel database (g) distributed databases especially web ones and heterogeneous and everyone is putting and O?DP wrapper around what's there.  we took a 1974 idea and glued on all this crap - we should start with a clean sheet of paper then we would not get architectures that said DB2 has the data and weblogix BEA has the code.  that's a dumb distribution in a lot of ways.  we have a gazillion of these things - they don't work together - rethink from a blank sheet - new query languages, interface, architecture.  sad Ted Codd just died.  new mandate needed to do this better.

 

Naughton -  meta-level comment - if you do what is helpful for scientists it is of enormous benefit for non-scientists look at web or computers.   will happen again.  every interesting technical challenge exists in the scientific arena - personal db, data streams, web data bases, all in more manageable ways.  and we can experiment on scientists more easily than on corporations.  and funding agencies will support this.  if we keep looking for motivation in problems that have already perceived commercial value industry will be doing it already.

 

Yannis Ioannidis - 2 things (a) if you want to buy a used car you can go on the web and bid; I see a need to buy data products - the value of a product may be a dollar price or in conjunction with how fast you get it or how complete the data is or how reliable - I want e-commerce techniques for query optimization - it's not just buying one thing; pieces of a query may run in different places; Mariposa+++ - multidimensional optimization using e-commerce techniques, lots of theoretical issues also. (b) personalization in databases - I should not get the same answer to a query as Ulman gets - much personalization in the Web but not in DB.

------------------------------------------------------------------------

[I think I heard a lot about streaming, about scientific data, about shared out architectures, and scattered other topics.  few people are doing what they did for their PhD]

 

Dewitt - nobody said they were working on Xquery - (Franklin, Carey, are doing some).  do you guys believe lots of stuff is going to be done.

Stonebraker - nobody mentioned concurrency control or access methods.  we are all out on the periphery. Bernstein:  there are people doing this. work in multiversion access methods.

Schek - there is some work on transactions in a more general sense.

Bernstein - dozen PhDs working on XML query optimization just not in research

Widom - people are doing optimization under the label "Xpath queries under streaming XML". 

Franklin demurs.

Hellerstein - few people needed storage.

Kickoff Monday Morning

Stonebraker -

a)       Vote on the organizer's view of what the important problems are - you each get 3 votes on the topics. From yesterday's.

b)       Tonight's session will be different - 1 hr on personal DB led by Gray & Weikum; 1 hr on "vision" - why did the last 4 reports have little influence - so we need this as a vision thing - just a brainstorming session want to end early today.  Everybody is encouraged to stop when the discussion seems to have petered out, not just use all the time.  The chairs will also do that.

Avi - who is audience for the report?  Stonebraker - the audience is researchers picking research directions and also funding agencies.  If we don't write something $1B goes to supercomputer centers and not us.

Hans - what did we present 15 years ago?  Is that still in the vision?

DeWitt will do the diff.

Bruce Croft - IR & structured data.

What is IR?  

  70s-80s research focused on document retrieval

  90s TREC reinforced the IR==document retrieval view

first bib records, then full text as time went on.

now doc retrieval is important - turned into web search other topics

                                question answering - finding short segments with particular info

                                cross lingual retrieval

                                distributed retrieval - now big

                                 topic detection and tracking

                                multimedia retrieval - images, video, annotating them - starting up now,

                                learning and labeling images and video with text.

                                summarization

IR & databases

                differentiated by unstructured/structured data

                what about marked up text and semi-structured data?

                 text has tended always to have at least a few fields

                recent database papers on nearest neighbor and similarity search

                distributed peer to peer search

                Web search

                info extraction

                text data mining

boundaries getting fuzzier

IR integrated with databases

                many such proposals - now in XML context -go back to 70s

                                e.g. combine ranked search and the specificity of user queries

                supporting a probabilistic framework is the key

integration vs. cooperation: do we really want one giant system?  or should

                we still have separate systems & separate capabilities but they work together

semantic web - "if you made the web a database" - this is make the web into  a knowledge base and that won't happen - we've had a debate for decades about manual vs. automatic representations of what documents "mean" and both work better than either one but creating the manual versions is very hard.  That's the lesson from the IR work

                go for knowledge or statistics?

Stonebraker - every Wall St house wants to merge the news feed and the stock prices - why is it so hard to identify companies whose price has changed and which are mentioned in the news ticker?

Gray - why haven't you mentioned KDD (Knowledge discovery)?  The field is very fragmented.  Every product has a text retrieval bolt-on to their database.

Croft - anyone who talks text data mining is similar to IR - that works well together.  The data mining in structured data - numbers -

Hellerstein - it's all based on clustering, etc.  Same as machine learning - there is a common set of technologies.

Croft - IR people like NL - want to understand how to describe and satisfy an information need in an unstructured world.  That gets us excited.  Yes, we built inverted file technology for large data but we focused on NL and the DB people have different needs.

Stonebraker - If I ask Google "what is the temperature in Lowell" I get a terrible answer. Why can't it invoke weather.com and fill in the Lowell zip code?

Croft  - We are working on that in the question answering world.  You do want some context - you want more than just "73" as an answer (did it come from Bob's home page or where?)  DB retrieval is fact retrieval so there is overlap.  Some people work on extracting tables from text.

Stonebraker - this is similar to the first time I heard the discussion 20 years ago.  The communities should cooperate and they don't.

Hellerstein - Not true!  There has been a lot of overlap, now forced by the Web - the database community feels weak on text - and then we found that the IR stuff isn't that hard.  Cohera and Whizbang are companies that had combined products.  This is a healthy area.

Mike Franklin - How many people have been to SIGIR conferences?  (few)

Mike Carey? We are organized into stacks. We should have a conference on a problem - not by community.

Phil Bernstein - This is always true.  There are always problems in slightly different areas and they always give an opportunity for taking techniques across.  Similarly DL is slightly different from IR.

DeWitt- We could organize a conference on a topic.  I like that idea.

(Martin Kersten?) - We don't need anything new - just join to attack a problem.

Croft - We do need something different than "you come to our conferences and vice versa". 

Timos Sellis - A few more applications?

Croft - Want to have an NL query and not think about it. 

Ullman - Re semantic web - you talk about semantics but when you have to do something you do syntax.  If you take the temperature in Lowell thing you ought to be able just to say "temperature Lowell" - How much more is there to do? Crawlers are bad at this because it is timely.  History in Lowell would work better on Google.  I’m curious as to what you think is the advantage of focusing on deep understanding rather than giving people tools to use?

Croft - This history so far is that focus on deep understanding and semantics has not produced benefits in effectiveness.  Learning patterns has been useful - applied probabilistically - the little words don't help.

Gray - People use mostly nouns and verbs - they can throw away the rest – a telegraphic interface.

Ullman - Temperature is special because you can't crawl the web and get temperature in Lowell today.  Google will decide someday to carefully look for weather the way it can look for maps.

Gerhard Weikum - NL is something that is not that great for queries - you need to understand the text that is there.

Ulman - Google works because it is simple.

Pazzani - The Google answer to "what is the temperature in Lowell" is a web site about Uranus (well actually he says he misspelled temperature – the actual result of the search is indeed the weather underground site)

Serge Abiteboul - The problem is that if you have some info you can put it into plaintext and that's ridiculous.  You have meta information and the question is when you have information if you start publishing meta-info it makes it much easier to avoid NL understanding.

Hellerstein - You can make schemas and just make things harder to use.

Abiteboul - Disagree

Bernstein - It's not just how you say things but how you learn - it must not be a manual activity to attach metadata.

Gerhard Weikum - takes over leading observations on DB, IR

business data is boring

                action is e-science, e-culture, and entertainment

absolute facts is a myth created by accountants

                uncertainty is fact, ambiguity is fact

hope for precise semantics based on universally agreed upon

                ontologies and perfect metadata

IR:

similarity search with ranking is the best approximation to semantic search

DB:

                can still leverage context - metadata, ontologies, multivariate distribution)

agree with Ullman - no such thing as pure semantics.

 

killer queries  where Google, dbms fails

Find gene expression data and regulatory paths related to Barret tissue in the esophagus.

what are the most important results in percolation theory?

Are there any theorems isomorphic to my new conjecture?

Find information about public subsidies for plumbers

Where can I download an open source implementation of the ARIES recovery algorithm
 (needs to be decomposed into several pieces).

Which professors from D are teaching DBS and have research projects on XML

Who was president of the US when George Bush was born?

                  (can't do the decomposition and linking again)

"Who was the French woman that I met at the PC meeting where Peter Gray was PC chair?"

                                a) go through email archives and find which program committees I was on

                                b) then look to find the chairs of those committees

                                c) then having found that this was VLDB 95

get the list of the members and see that Sophie... came from Inria, Paris.

                                d) know that Paris is Paris, France.

Garcia-Molina - you are working in AI.

Weikum: Looks AI complete but you can do this with dumb things

Croft - Finding isomorphic theorems is the hardest one

Weikum - There is an "open math" project.

Croft - For question answering actually TREC does fairly well on that.   There are a lot of factoid questions and the current systems are finding 70% of the right answers in the top one/two. But these are not factoid questions. ARDA now sponsoring ACQUAINT which looks at things like this in the intelligence domain.  They want to find authoritative docs.

Weikum - People expect to type a few words at Google and get the answer the goal should be to minimize human time - you learn how to rephrase query

Agrawal - Some queries will get money: "what websites accept Visa/Mastercard but not Amex" - Amex will pay for that;  but many queries people won't pay much for an answer we need to understand which queries have to be cheap and which can be expensive.

Weikum - Not sure money invested in the right things. 

Timos - What is missing from DB ?

Weikum -  Knowing which database to look in for the "gene expression data related to Barrett tissue" - there are many gene databases on the web – and each has its own schema.

Halevy  How much is understanding the query and how much is mapping it to formal SQL.

Silberschatz – Some, I see how to map into a DB and the one about math I can't.

Weikum - There is the open math activity - suppose you have high school math text books and we have codified them into logic. Some inferencing capabilities in that - you can then mimic this.  Pattern matching on XML.

Croft  - What are the drivers for integrating IR & DBMS?  You could build special purpose systems for each of your examples - or you could try to do this as IR.  But where do you have to unify the systems or make them communicate?

Mike Franklin - That’s the key question

Stonebraker - If you want metadata - e.g. super-duper UDDI - that's what we bring to the table.

Weikum: Shouldn't we formulate this as a meta-query - not SQL.

Halevy-The fundamental problem mixing the two worlds is that we have a subquery in some formal world and we go to a repository and all we have is text.  How do we come back with an answer to do joins?

Weikum - You could XML all the data you see on the web; but not sure which tags are important. Asked students to do researcher home pages and grossly underestimated the difficulty.  And it still doesn't handle ambiguity

IR strengths

  methodologically rich - statistics, probl, logic, NLP

  appreciation and experience with machine learning

  awareness of cognitive models for end-user intention and behavior

DB strengths

  integrity, scalability, availability, manageability

  system engineering

  resource optimization - caching, memory mgt, query opt, physical design,

  scheduling

Mike Franklin - Databases allow manipulation - update, summarize, aggregate. This is more than IR does. What about "find average salary of a US CEO" which will require computation.  Breaking this into the two queries is hard. Argument about whether it is easy to figure out that you need an average. The IR people can find a table but not do an average; the DB people are the reverse

Croft: IR and Google are not synonymous.

Stonebraker – It’s easy to express your query once you have a table; the hard part is putting together the table.

Hellerstein - You don't even know what the breakdown should be.  It is harder than you suggest.

Maier:   Human attention is scarce resource.  Where do you apply it? Writing metadata?  Google harnesses this a little bit.

Weikum  - DB & IR: issues & non-issues issues

  exploit collective human input

  use ML & ontologies

  flexible ranking to XQuery

  use ML to convert Web to XML

  extend Google to deep web

  break google monopoly

  acquire broader skills

non-issues - we can do these

  crawl structured data

  simple IR on XML

  polish XQuery and implement efficiently

  homepage.xml schema

Again we need probabilities.   As a special case we do traditional DB with result certainty 1.

Google is popular because of ranking and coverage  

Ullman: No,  they were popular when they had less coverage.

Weikum Afraid of Google having a monopoly.  Want to have a peering system that spreads out queries.

Mike Franklin - Purest merger of DB & IR is in annotated scientific databases and this problem is important today.  You need both DB & IR.

My session on infoglut - rather flat

 

Mike Franklin.   Info shadow is a problem.   I look for Canyon Creek development near me and it is buried under a lot of stuff about the same name in Texas

Gray - We need spatial search and also time – this pushes to a schematized metadata search – not just flat text.

Lesk - also proper names

Bernstein - Yahoo does categories.

Pazzani - Google had a student contest for new feature and the winner was geographic search.

Lesk - we need to think about video

Hellerstein - Sensors and sensor fusion generate lots of info there that is somewhat  structured.

Snodgrass - We need results on impossibility.  Which IR tasks aren't worth trying.

Lesk:  IR doesn't do much of that.

Croft - We try to categorize queries One TREC category is web search.  We are learning about queries and what we can do, which will work

Gray - DB wants schematized search. IR doesn't.  Syntax has lost to statistical search in IR. Is  there a place where syntax works?

Lesk - OPACs - but Amazon seems to do better.

Gray - We should make the Web available as a study item for linguists, sociologists, etc.

Garcia-Molina – Our group at Stanford is doing that.    

Stonebraker - I have a 6th grade daughter - her teachers ask her to look up dinosaurs.  It is hard to find things appropriate - journal articles worthless.

Lesk – Search should do Flesch score and picture/text ratio.

Gray - Spam - learn what is not interesting and what is interesting in user context.  Profile people

Croft - Contextual IR is an active area.

Move to what DB flavor in e-sciences

Hans - You said you wanted convergence - should ask about mergers

Bernstein - We focus on the query, The IR stuff is on preprocessing - organizing,  he says MeSh/UMLS got them from 50% to 80% performance - likes thesauri

Lesk - Disagree schema helps very little. Described how the Internet Archive works. What could databases add?

Gray - It would run a lot faster – it is unusable now.

Dave Maier - Tobacco docs - comparing with open lit - query by a smoke chemist is "what  in these docs contradicts the journal literature?"  It is hard to do that.

Croft - That's info extraction

Lesk - see Futrelle doing that.

Weikum - What is the purpose of the query e.g. insider trading.   That's text and numbers.  They organize ahead of time. In your case you didn't know in advance how the documents would be used (about smoke chemistry.)

Hans - What services do the sides provide?  IR can do text categorization, DB can do engineering

Final vote show of hands: 3-1 for "I couldn't find it" over "it wasn't online"

Serge Abiteboul - XML  (prepared with Jennifer Widom)

  a boring research topic?

  a new frontier?

  a means to keep standards people busy?

XML

 rapidly adopted by industry

                format for exchange of small/medium pieces of data when archived grows to large volumes

                a data model - for a wide range of kinds of data.  not relational - permissive typing, full-text search

the database community should be involved and perhaps concerned

XML issues

                storage of XML

                native vs. XML-relational

                 lesson from OODB - this is a business issue  but the vendors are

                not trying to block it

                efficient representation and compression

                key issue is interface - not clear whether it should be like a DB -

                DOM, SAX,  - or a query language -- needs work

                revisiting old topics

                database design

                integrity constraints

                 concurrency control

                access control

reinventing the world

universal query language for XML

                problems with Xquery - promoted by W3C

                focus on complex queries need simple filters, IR style search

                too complex, ambitious, too much politics

                can you really go from documents to data

                people want to do what they did in SQL and others want doc search - this is hard

                can we undermine Xquery with something better?

                                thinks we need small core OQL plus plug ins

                                running late - we need standard now

This direction deactivated by XQuery

                scientific: is Xquery good or bad from a scientific viewpoint

                politics: should we push for it

Weikum: SQL can be segmented.

Stonebraker - You can't talk about Xquery without talking about schema.  That is what has to be subset.  Big tension between what the doc guys would like and XML.  XQuery does everything.  It makes IMS look simple.

Gray - We had identified Google with IR and now we are identifying Xquery and Xschema with everything.  We should think from a blank sheet of paper.  OODB did not fail;  it made object-relational possible.  An approach here is that the train has left the station.  We can't do much - there is an alternate path which is a much simpler query language.  You should pursue that if you have a better idea.

Query optimization

                for subsets of the language

                tree structure is a new ball game - new index structures, cost models, etc.

                 depends on storage

                revisit distributed query processing and view maintenance

everything being studied

 

Foundations

                lots of work on semi-structured data

                first-order logic and relational languages: strong

                OQL/functional languages: reasonable

                full-text search: messy

 typing

                 much more complex than in relational world

                 not settled

                query type checking, type inferencing, update consistency

                very active area - people from DB theory, functional programming, etc.

all this again is active, but problems not simple, need more work.

real frontier: world is changing

old vs. new data management

 

Old                     New

closed world       openeness

client/server      P2P

distributed db     web-scale data

query/answer       subscription queries, stream queries

active db          active databases + web services, service discovery

QBE interface      new interfaces

 

research must focus on new issues - not single site data

beyond XML: the semantic web e.g. putting music on the internet was a very nice problem and the solution

was elegant (Kazaa) even though the lawyers disagree - uses little traditional technology

 

Widom - When did you add semantic web?  I'm not responsible for that.

Abiteboul - All this is syntax.  Makes Ulman happy; the most fundamental difference from relational DB to web is that you don't know the semantics. 

Ullman:  A high order bit for the report is "is querying XML too important to be left to W3C".

Stonebraker:  A simple thing to say is that Xquery is a pile of crap and XML schemas are a pile of crap and we can't influence that. If we had a clean sheet of paper and wanted to do something right, we would focus on merging doc world and structured data. No standards body can do this.

Widom:  People are implementing this it's too late.

Lesk- So what is an XML success story?

Abiteboul - Newspaper articles - All were in separate formats.  Now they all use XML,  particularly NewsML.  Now we can merge 5 newspapers.  You have parsers and editors  and you can publish with very little effort.

Maier - The tools are very important.  I studied data interchange formats and found that people agree on what things mean and without tools there weren't used.  Some things left behind like array data.

Gray:  Another plug for code+data; HTML started and people wanted to send script and when you send me XML I don't know what it is, just a bunch of tags, you have to send me the methods as well.

Abiteboul - Before methods you need metadata; then you provide code.  We should be more active - things like UDDI are dirty.  We should be helping here.

Gray - Dave Clark has a nice model for standards.  There is a period when it's too early for standards and a period when it's too late - research and production phases.  You need to be in between.  I do not think we are at the standards phase with our ideas yet. We still need more prototypes.

Abiteboul - We're working on Active XML - XML with embedded calls.

Stonebraker - You said we have to worry about views and updates – everything that came along with the relational model.  It will be more complex – this is what collapsed IMS.

Widom - You can write lots of papers. 

Stonebraker You're too optimistic

Abiteboul We have a lot of models.  In a distributed session you probably will do some integration of things that are very relational -- integrating at the tuple level.

Stonebraker - Part of the IMS difficulties were restrictions on views.

Abiteboul  - OODB also had trees.

Ceri - If there's a lot of XML data out there we don't have the luxury of not dealing with it.  Because hierarchical is the wrong way from scratch we can't ignore it.

Gray - The IMS data model was designed by blue-collar programmers, no theory. Don't postulate that there is no good hierarchical data model because IMS failed 30 years ago.  Nobody has ever tried properly.

Bernstein - We can count on incremental forward progress.  All the relational products are making big investments in XML.  The data capture is inherently semi-structured.  e.g. there is always a "comment" field.

Widom It would be absurd not to bless the area

Bernstein - But people do think it is boring, the same areas as ten years ago.

Hellerstein - We should focus on more IR things with XML and here is a list of plausible real problems (the "new" in Abiteboul 's last slide).

Widom - We spent all morning moaning about structured data and IR.  This is a chance to do something about it.  The next language should be more IR-ish what went wrong with Xquery?

Alon - Too many politics.

Lesk - Look at Dublin Core - every time they meet the standard gets thicker and heavier weight.

Maier - The manuals are too thick -  but SQL is no better.

Kersten -   Query sessions are missing from this discussion - not just one query in isolation.

Iannis - Database people know queries.  Actual users explore in unstructured ways and this often finds the most interesting things.  Queries are important;  but other things are too.  Context, personalized stuff, other modes of interaction.

Lesk - ranking, visualization

Hans - processes, flows, combinations of services .

Ceri -  I want similarity based browsing.

Snodgrass - We don't know if algebra is better than calculus.

Iannis – It is not an issue of calculus vs. algebra.   Declarative vs procedural is more important.   I did a study:   for simple stuff declarative is fine for more complex stuff procedural is needed.  I don't know what kind of interface to give people.  But, none of this has to do with XML.

Franklin: Semi-structured data is the big issue.

Maier – Why is there is no XML on the web.   Are we doing anything to help with XML that is streaming?

Abiteboul  - Two questions a) not much public XML but lots in industry b) how do you handle changing data?

Hellerstein - If you take queries over streams and add distributed databases you get routing which is a big topic in the networking area.

Pazzani - In a startup XML is being used as an interchange language and then it gets dumped into relational DBs.  Also used as an intermediary for different screens, etc.  Not much going on in XML data bases.

Bernstein - Quite a lot going on.  Talk to vendors our product people can list many big time customer with lots of XML data.

DeWitt Is it simple or complex?

Bernstein - They want to do queries. There is a wide range of tasks. We can't move fast enough.

Widom There is no relational on the web either.  We don't ignore RDBMS.

Bernstein - Research on XML as a data model also has room for innovation.  Don't be negative about lines of traditional database research that can be applied to XML

Widom - The conferences are 1/3 XML now.  It is not problem that there is not enough work.

Stonebraker - If you do research that competes with the vendors. That's not research.  A big problem we have is that a lot of what we do is too close-in.   Vendors will do this.  We should do something Oracle is not doing.

Widom – For example, query optimization for XML is not for researchers

Stonebraker - Yes.  Don't do that.  Leapfrog to the next data model. XML stinks it is too complex.

Widom - XML and XML schema are different the schemas are too complex

Hellerstein - Our CS colleagues won't fund us to work on XML query optimization, but many other things would sound better.

Agrawal – This is not a firm statement but anecdotal info is that XML being stored right now is very simple.  A relational tuple or other simple structure.  The complexity of schemas that are coming is justified.

Widom – For example, an airline record has a few structured fields and the comment field;  that does not need all of XML.

DeWitt - We should take a stand.   We're going to get blamed for Xquery and Xschema.  People will say it came from the DB community.  Ullman said we should repudiate any association with Xquery.

Widom - We can't do that we are already associated with it.

Croft - As an outsider reading DB papers I do blame you for Xquery.

Stonebraker This is easy we can say it is commercially important but we can do better.

Maier - What should we do as a data model if our goals are openness, peering, and so on?

Lesk:  Whatever you do put <> around it and call it xml.

Widom - Nobody has a beef with just XML

Abiteboul  - XML is just markup with simple markup,  then the schemas come and made the problem.

Widom - Why did everything get so complex?

Mike Carey - what is our purpose as a community? 

  1 - produce great new ideas: ie write off Xschema and forget it

  2 - structure the field (credits to Jim and Phil)

  3 - educate the workforce - Wisconsin produced students with experience

        building industrial strength software claim Paradis better than DB2

Gray - Some of us - Dieter & me - work at companies with hundreds of PhDs who are doing the "how to make XML work" part.  The community is working in this area, but where should the research work, not advanced development, go?

Carey - If we focused entirely on research many of the Wisconsin thing would not have happened.  

DeWitt - Should we focus on Xquery optimization so you're educating the work force for the current jobs?

Gray - the academic community completely ignored SQL.  They said it was brain dead.  That was fine, it happened anyway.  I think we are in a similar state re XML- XSD-XQUERY today.

 

Stonebraker calls time.  ten minutes to lunch.

results of the poll on the gong show

    federated, heterogeneous              13

    querying the internet                      10

    personal db                                      8

    open source                                      5

    privacy                                              5

    visualization/new interfaces         5

    probabilistic                                    5

    autonomic                                         5

    db tools/cybertools                         4

    experiment management                4

So how adjust schedule:   Add querying the internet?

Hellerstein – No, we've done that.

Bernstein - We discussed visualization in 1989 - it never goes away.

Dave Maier –I would do experiment management.

Agreed to add that.

Stonebraker will do visualization, interfaces; frustrated that no one in this room is working on better UIs.

Aside: Abiteboul  is working with BnF on archiving the web; they are changing the law to get legal deposit on French (country) websites.]

Haas - Top 10 Reasons why Federated Can' Succeed and why it will anyway

Carey

Brief history of federation

  Multibase @1980.

  many attempts since - every few years with new model  

                functional, relational, object-oriented, logic-based, XML

  still not solved.  last night we all brought it up again

  will we ever solve it?

Haas

 top ten reasons against federation - I get whines about all of them

10. Robustness:  Systems fail, sources unavailable, more pieces mean more failures, so with robustness. (objections: DeWitt - google; Hellerstein - peer2peer; Stonebraker - your company is selling "sysplexes" which are single system of things that can fail;  One piece of big iron will do better than 500 linux systems - sort of anti-federating.

9. Security: different systems have different security mechanisms, hard to have a coherent view of permissions; more points of failure, harder to make guarantee; and data is sometimes the "corporate jewels" and needs to be protected.  Schek - look at e-health: would you trust that to a federation?

8. Updates recording change is not always an update.   sources may not be databases; may have to go through an application API to do an update ACIDity - not all data sources support ACID properties – transaction semantics not always possible.  e.g. our current system doesn't support 2-phrase commit.

7. Configurability:- hard to set up too many architectures possible; many choices, little guidance.  Lots of code to install and lots of connections to support

6. Administration - hard to keep up monitoring is hard; not all sources have tracking facilities; tuning is difficult; repairing is painful, need distributed debugging and you have to deal with different vendors

5. Semantic Heterogeneity:      hard to identify commonalities - same terms, different meanings (but this is also a problem in a single system with the same data)

4. Insufficient metadata: all sources have different metadata with no uniform standard

3. Performance (data movement): need to move data, geographic distribution is common and the WAN is slow; large data volumes common and you can't just cache because changes can be frequent and hard to track, plus storage is not unlimited.

2. Performance (complexity): decision-support applns do complex queries and choices give big differences in performance.  Some sources may not have enough CPU power and you need expensive functions of data.

1. Performance (path length) simple queries - even OLTP like - have huge overheads simple queries are common - easier to write, automatically produced. Should use one big query for performance but not written.

Mike Carey – Q: we have had these problems for 20 years so why will federated succeed? A:  It has to: integration is a top IT issue and not going away alternatives are expensive and/or painful write it by hand with 10 different APIs.

     EAI/workflow solution consolidation - warehouse, data marts

Maier - How do you know about the data?

Bernstein - You do this in big  meetings.   Also simple scenarios exist - may not need high security or robustness for some applications.  Customers know the data; need is great and compromise is possible.

Progress being made - 20 years of distributed query processing.  Plumbing is in place; connectivity there.  Reliable messaging. XML is now sort of basic agreement on how to exchange data.  XML schema  is a way of describing data.  So we're getting closer.

What would we do if it worked?

   retire?  integrate the web - data google?  p2p database?

Is research warranted?  what are the most important topics?

Bernstein - The piece of this where we're making progress is semantics.

Maier - Look at blame allocation - be able to write down expectations of what the pieces should do and then be able to see what is happening.

Ulman - When you have enormous amounts of data you have to be uniform in your dealings.  You can't write code for every 100 bytes.  Once you have declarative languages you have to use query optimization. 

Stonebraker - Cohera found out that you didn't mention is that  semantic heterogeneity nearly always involves dirty data - and cleaning data is better done in bulk. 

Maier - In health data they want to get something going. Federated is easier and if that doesn't work fast enough they might try to put it in a warehouse.

Haas - We are doing a service integration system based on db2.

Croft - Does federation include resource discovery? Does it include schema?

Haas - Federation includes metadata - I didn't consider resource discovery separately.

Halevy - To feel better about what we have done we need to focus on who are customers are.  If people can put things in a warehouse they will do so. We need to go after the people who can't do this, who must put data in a warehouse.

Mike Franklin - Semantic heterogeneity not so bad.   Security is more serious.  They won't let people into their systems.

Carey - Sometimes all you have is a minimal interface

Stonebraker - You often have a non-relational interface which you have to wrap and then try to federate at a relational level - You might be better off at web level.

Garcia Molina - Why didn't anyone else vote for workflow;  Distributed workflow is similar.

Hellerstein - On topic of reliability, the is lots of exciting work in networking.  You can find key value in log number of links - p2p networks.  db community don't talk to these people.

DeWitt - Distributed hash tables are not going to solve the world's problems.

Hans - You have underemphasized the problems of security and reliability. We can't live with low standards of accuracy - again see electronic patient record.

DeWitt - So what is the message? Laura says it’s impossible and Stonebraker says its done.

Lesk – The intelligence community tells me you only get a keyhole into db - they refuse to federate.

Agrawal - They want "need to know" information sharing - minimal information to be delivered.  We have paper coming out.

Stonebraker - Two great success stories & one great failure.  (1) Airlines have been federating for years - very successfully.  When you have only half a dozen elephants and a huge incentive it works. (2) Both Dell & Wal-Mart have federated their supply chains.   One big enough elephant.  (3) RosettaNet - electronics community trying to federate their supply chain.  No big enough elephant and so it is not working. There is the same problem in autos.

Laura - Will work in specialized cases. we should solve some of these problems.

Hellerstein – Tools are good. We won't solve all of these - we need to deliver tools to content managers.

Ullman & Rakesh Agrawal: "Data Mining on Steroids"

Ullman – I am the only CS person who says in public favorable things about TIA. The DARPA John Poindexter & AI community project.  On 9/11 you had four guys with visible Al-Qaeda connections who went to 4 different flight schools with no connection to an airline.    If you could integrate all these records, you could have asked the right query.  This happens at two levels. a) How many al-Qaeda guys have been to flight schools? b) Even more ambitious - What strange things are going on?  But how rare was this?

 

This is an interesting problem ; locality-sensitive hashing to focus on connections.   We need to find just a few events that are the most interesting.  The technology is not there yet but it is an interesting problem.

Gray - The license plate of the guys who were the Washington sniper was looked up 18 times in a few weeks. Nobody noticed this large number of lookups (and all were in the vicinity of one of the shootings) - because of different systems.

Ullman - You need Bayesian theory to tell you how unlikely something is.

Agrawal - Data Mining - Potentials and Challenges

observations

                some transfer of data mining research into products

                most in vertical applications

                horizontal tools - SAS Enterprise Miner, DB2 Intelligent Miner

                data mining in non-conventional domains

                new challenges because of security/privacy concerns

                DARPA initiative to fund data mining research

identifying social links using association rules

                crawled about 1M pages and found Arabic names and charted links to make

                  a social network.  the most popular name was Al Gore- they blew the

                Arabic name identifier. 

Hellerstein - Why not use a graph clustering algorithm?

Agrawal – We are using association rules.

Ullman; - You need a strength measure.

Agrawal - website profiling using classification.  training on labels like "Islamic leaders", etc.

Discovering trends using sequential patterns and shape queries - trends in patents, heat removal, emergency coolings, zirconium alloy, feed water. You look for a shape of the graph of % mentioned vs. year of those words. You sketch a "resurgence" in this case - V-shape - drop and then come back.

 

They are discovering microcommunities - tightly coupled bipartite graphs – e.g. Japanese elementary schools, Australian fire brigades, - you find tight graphs and then you manually label the areas.

 

new challenges 

                 privacy preserving data mining

                randomizing the data in a way that destroys individual data but not the summarizing stuff

                cryptographic approach

                privacy preserving discovery of association rules

                data mining over compartmentalized databases

                frequent traveler rating model - with demographics, credit ratings, criminal records, etc.  

TIA was going to build a giant warehouse and got flack
perhaps one could use randomized data shipping or local computation.

 

Croft - System to return a probability that it can return relevant data and then you go get permission.

Stonebraker - My discomfort is that in theory all warehouses are built for data mining but in fact nobody is doing any of it and the vendors are going broke.  The people I talked to were doing fairly simple things.  No statistical expertise on their staff.

Agrawal - Lots of leading companies are doing this.

Weikum? - The field is approaching saturation.  Interesting research but it is not for 10years.  It's incremental.

Silberschatz:  If we solve TIA in 10 years  I would be surprised.

Ullman  - even if you give me everything in the world integrated I still can't ask the right question.  even more mundane - what is a gene.

Agrawal

some hard problems

                 past poor predictor of future

                 abrupt changes; wrong training examples

                actionable patterns

                how do we find what is surprising?

                over-fitting vs. not missing the rare nuggets

                how insure not overfitting - still hard

                richer patterns

                 in medical domain - you need dags

                 simultaneous mining over multiple data types

                text voice and structure data

                when to use which algorithms

                avoid the everything looks like a nail to a man with a hammer

                automatic selection of algorithm parameters

CMU is now offering a degree in data mining (Tom Mitchell running program).

Pazzani - Management schools have been doing some of this for decades

Hellerstein - Many of us don't understand statistics - we should be  educating ourselves.  The undergraduates should be taught a bit more.

Gray - There is a popular book by Jiawei Han that is a nice intro and course.  The challenge is that SAS and other tools are chauffeur driven.  We have to make it easier.  The science community has a size problem.  Business has 1000s or 10ks of records or can subset and use quadratic or cubic algorithms.  Science users have very large datasets (billions). They need log-n or linear heuristics.  GenBank is about 40 GB right now - fairly small.

Hellerstein - We have an area that overlaps with statistical AI.  We need to talk about what we contribute.  people tell us our math skills are not up to the job.

 

Discussion

is datamining "rich" querying?  is it "deeply" integrated with database systems.  most current work makes little use of database functionality

  should analytics be an integral concern of database systems

  issues in datamining over heterogeneous data repositories.

 

Weikum: Should data mining be linked to data quality?  Biomed people very anxious about this. 

Agrawal - yes.

Pazzani:  DB community could teach machine learning about data that doesn't fit in main memory.  You must avoid things that take 10 passes over the data.

Snodgrass - Perhaps we should focus on summarization, visualization, then let people make deductions.

Ullman - I agree, this is one aspect but if all you have is visualization you need help.  Suppose you have 10-D data and you have to know which are the most interesting dimensions. 

Ceri  - What about semi-structured data?

Ullman - I've seen it but it's derivative. 

Abiteboul  - I've also seen it.

 Break

Stonebraker - this is boring, what to do?

Density of incrementalism to insight is high.

Gray & Lesk: Suggested tossing the agenda and asking if anyone was passionate about anything other than selling your own research.

Schek: - We just have too few breaks people want fresh air (1/2 the group had left after the break).

Maier - So what?  Should we plan the  wake for DB?

Gray - In previous meetings there has been conflict - relational vs OO; logic programming, XML.

Stonebraker: I'm happy to present a controversial vision statement. What's the purpose of this meeting? In previous cases there were research branches - right now I don't hear the controversy - we are all working away - not at a turning point.

Gray - Why are we here?  It's a 5 year interval - no specific agenda. It was not the field is in crisis.  Last time we said text was going to be important but we have not done squat. 

Schek - Other people did the work.

Ullman - I proposed 1 hr ago that the DB community should take charge of TIA.  Use a systems approach.  the spirit of TIA today is an AI spirit.  Describe a wonderful vision with no idea how to do it.  I'd rather work on version 1.0. 

Croft: - Enumerate research issues in TIA   

Ulman - Make clear it is a database rather than an AI issue.

Snodgrass - If you look at last reports they state 30-40 year goals at high level and of course we haven't reached it. 

Agrawal:    We should have some nearer term goals.  

Croft  - So what have we done in the last five years? (Xquery?)

Stonebraker -It looks to us like we're dead on our feet. 

Gray - I'm excited but it's applications, and I'm filling in gaps.

Stonebraker – OS people have quit doing that work - perhaps DB is a mature field and we should also drop things like query optimization. So I propose- we morph after dinner - 3 or 4 people to present visions of some sort that can't be achieved in ten years and listen to that.

Agrawal - One thing that would focus or excite us is some interesting application and TIA might be that thing  It has database issues.

Gray - I have political problems with that. TIA has a big-brother overtone.

Stonebraker  - This evening is anyone can get 15 minutes to say something that can't be accomplished in the next decade.  No restrictions other than that.

Ulman - I understand the political issues about TIA - but it needs to be done. Just as city dwellers 5,000 years ago needed walls around their cities.  It is a national need.  The government gives guns to 1.5M people and relies on them not to invade your home.  The political problem is to create analysts who get information and don't abuse things.

Stonebraker - This is a subset of heterogeneous federation and data mining.

Lesk: Three challenges e-science, TIA, personal memex [we've now killed 20 mins without getting anywhere]

Stonebraker: Integrating the deep web.

Gray: we have 24 hours left.  Is the field really stagnating?  Should we look for other careers?

Stonebraker - This discussion is very similar to the one 5 years ago.

Abiteboul  - In 1981 people told me databases were dead.

Gray - What has been discussed so far is incremental.  Oracle, IBM working in the mainstream.  What should the researchers be doing?

Abiteboul  But those guys don't publish so we need to do the same work.

Gray - They write a lot of papers.

Croft - Other areas are defining testbeds - so people could compare techniques. e.g. MT recently - was moribund and then defined a new measure 1.5 years ago and excitement is way up.  (overlap of ngrams).

Ulman:  When you define a measure of progress people make it increase.

Croft - You have to come up with good measures

Maier - Alon was saying for semantic integration what if we found something for people to try - a corpus of 1000 large databases.

Garcia Molina - Why is it bad to have the same list as 5 years ago.  These are hard problems - should we only work on things we can solve in a year or two?

Bernstein: - It would be a problem if we had only the same solutions and were making no progress.

Gray - What progress have we made in the last seven years.  Lots of things in data mining, cubes, auto-tuning, materialized views. In 1996?1976?  Don Slutz was sending queries to DB2, SQL systems - 90% of the time he got the answer and the rest of the time he got a crash.  Today you can use database systems and that is a result of research.  Research in QA, fixing query optimizers.

Garcia-Molina: Is Google an accomplishment of last five years?

Silberschatz: Do we teach Google in DB community? -general yes I have a lot of data on my desktop and I don't use any database tools to manage it.

Bernstein - People use Outlook to manage their contacts (1/3 of the room?)

Hellerstein - failure with Gong show is that we talked about other people's work.  (Laura had said this earlier).

Q: Should we just repeat the last report?  Say it was the right program.

Croft: how do we move ahead? A number of people said this was a really exciting time - so much data around and people care about it.

Lesk: - Get people to do their own queries.  just like IR.  that's what made it exciting.

Maier - We have a lot of people who were at Laguna.  Many of us are on their last research project.  I can't do something which is ten years out. Maybe we have the wrong people.

Hellerstein - Disagree completely;  wisdom has value. Phil can e.g. take risks at his stage in the career.

Garcia-Molina - The world is knocking on our door.  There is a threat from terrorists and are we going to say there is nothing to do.

Maier: Who's bored with their current work?  (only Ulman: puts his hand up) Carey and Halevy were the chairs of the two main conferences- What are the big issues? 

Halevy - We had a lot of data mining papers and all but one were rejected.

Stonebraker  - I can summarize as "in the past there has been a sea change" and in 1997 it was the web.  Now we're just plodding along.

Gray - Webservices are a sea change.  People can now publish info on the Internet, not just html.

Abiteboul  - Deep web.

Franklin - Instead of a gong show we go around and you get 30 seconds for what excites you.

Stonebraker - we will spend 1 hr after dinner giving 2 mins to each person to say what you're excited about or to present a grand challenge.

Stream processing - where's the Beef or Beer   Dave Maier and Stan Zdonik

applications

                real-time enterprise

                financial data feeds

                supply chain management

                sensors

                environmental monitoring

                RFID - radio frequency ids - e-zpass type - Gillette just ordered 500M at 10cents each.

   Network monitoring

the sensors are the things that have triggered the big interest

 

what are the issues?

quality of service?

what's wrong with existing technology?

 

issues

                push+latency: the data just comes but it ages fast

                dbms - system controls data flow and optimizes throughput

                sdms - sources control data delivery and you optimize latency

                update followed by query - not fast enough

                overload is possible - rate-based processing

DeWitt -  I see no evidence that optimizing for latency & throughput are different. If you take a standard DBMS and forget about persistence it's the same.

Gray - Standard systems have response time thresholds and try to answer as much as possible.  It is the same thing.

Croft - We also need different architectures to do 100K profiles against news wires.

Gray - In databases you treat queries as records and it works.

Maier -Is there always duality like that - queries and data invert?

 

adaptivity

                loads change - so can not do a static plan

                adaptive optimization issues

                scheduling, load shedding, distributed bandwidth-aware optimization

correctness

                semantics may not be deterministic

                approximation, independent streams not synchronized

                transactions do not seem central

                update in place not the norm

                overlap of answer arrival with query processing

                mix queue-based processing with traditional storage

 

Silberschatz - At Lucent we worked on real-time billing - you append the call record in the database - and later you ask about the database.

Hellerstein - The only fun here is when you do distributed - push processing into the routers. 

Franklin - You are missing (a) multiple query optimization problem, and you have to handle queries entering and leaving the system.  (b) we have no agreement on what semantics you want - no agreement on time windows, etc.