Raw notes by Mike Lesk
Hans Schek - new infrastructure for information space - this is thousands or millions of databases and computational services that do multimedia and text classification, term extraction, an architecture with lots of services and I call the new architecture "hyperdatabase" - databases can sit on all our devices. Analogy - a database is a platform for developing applications on shared data - hyperdatabase is a platform for developing services; instead of indexing we need feature extraction. Research area not just for us.
Mike Franklin - working on suite of projects about query processing in strange and interesting environments - Telegraph with Joe - stream processing how do this with lots of sharing, adaptive processing, and sensor networks - how push this out into the network - lots more interesting work to be done. Errors, lost messages, nature of sensing the environment. XML broker - how process large numbers of Xpath and Xquery queries - 10s of thousands or 100s of thousands. Also applying query processing techniques to the Grid. Make it more interactive and more easily programmable.
Bruce Croft - talk from recent ICDE conference - developing new probabilistic model of retrieval - applied to cross-language retrieval, image retrieval, and tighter integration with speech recognition, and MT. Just started working on pushing that to do retrieval in semi-structured IR domain. Can we provide IR API for semistructured database?
Joe Hellerstein - bringing data independence to networking - network is not just moving packets like post office- trying to write intelligent programs on top of a very volatile system - programs must be robust against nodes coming and going - we're doing sensor networks and peer to peer processing - convergence between graph algorithms and query optimization - we need adaptive algorithms - I enjoy both algorithms and building systems - having more fun than in several years.
Jeff Ullman - group should address TIA problem - linked discovery or chains of discovery in multiple databases - the technical problem excites me. Query optimization is not the same for streams as for traditional databases - the new stuff with XML is not just the same as SQL optimization.
Rick Snodgrass - methodological basis for our field - now hampered by our methodology - in science knowledge is encoded in theories - scientific theories are testable and make predictions - our basis is twofold (a) how about it, (b) if you need better performance we'll put something on. We test on a few data points. We don't have scientific models. We need a list of needed scientific models. I have 4 suggestions which I sent out in email - can't do this in a few seconds. Each model is predictive and testable.
Avi Silberschatz - many years ago I had a dream - my laptop was a database machine but universal access to all data in the same way - some people in Stony Brook had the same goal - the database would sit below the operating system. Dont want to have to remember things like "lpq". All the data in the world will sit in some form of database with universal access to it. Lots of research issues- would be great to accomplish this.
Mike Carey - stopping doing research a few years ago - Im in industry waiting for problems to come to me - working on XML - adding workflow to Xquery so it can do data transformations - now using XML schemas to think about your data and Xquery to express integration want to integrate services and data.
Alon Halevy - my goal is to get people to stop complaints about semantic heterogeneity - want to automatically match between objects in different databases - experts use names and values - but over time they see lots of schemas and they get good at this. Were using a big corpus to learn things-e.g. typical attributes for a field named "student" and using patterns of this sort to match between different schemas and reformulate queries for a database we don't know anything about. This is part of our idea of how to do Google of 10K databases; we reformulate your query. More generally cross the "structure chasm" between IR and DB world - make it easier for people to author and query data.
Rakesh Agrawal - how can we make our data systems more privacy aware - can we design info systems which will be sensitive to privacy and data ownership but not impede flow of information. Two primary drivers - technology is too invasive in wrong hands - build in antidotes; and new business models that require cooperation plus national security requiring need-to-know for sharing; and the underlying technologies for these are likely to be similar.
Mike Pazzani - I'm here to
listen - want to hear things that will increase the NSF budget in this area -
so one mission is to ensure progress of science and self-generated issues from
DB community might lead to 5% increases; this year it might be 15% - to the
extent you find interesting publishable problems which help people in the
geosciences or biosciences there may be double digit increases - from $9M this
year - take things like making it easier to get data into a database if it is
not text - if it is a time series or images or chemical models - people have no
idea how to do this and they're using undergraduates toolkits to make this easy
might lead to increased funding. Second semantic heterogeneity is important -
Atkins report says some scientists spend 75% of their time moving data from one
format to another and if you helped with that the scientists would love you.
I coordinate our relations with homeland security but TIA is not that
different from finding all data about SARS or
Jennifer Widom - an extremely
specific problem which will force us to think about some interesting things -
keyword search over XML and have it give the right answer. (she was willing to
stop there) IR-DB thing. a challenge to deal with semistructured
data - deal with metadata - it all comes out if you IR like search over XML
which has data/metadata mixed together and we don't know what the data really mean
- if you go to XML and give keywords this also has to do with ranking,
probabilistic reasoning, and it all comes together. (Schek some people in
Stefano Ceri- most of the web systems publish some information which comes from databases and publish some dynamic data I work on principles which let you build some of these systems - I envisage a future where we could teach how to build web applications. up to our community.
Stan Zdonik - stream processing systems - addition of quality of service makes them different and unique. by having quality of service it becomes more difficult Mike Franklin and I worked for a long time - used profiles - tried to understand user needs and application needs - if we want to move to autonomic systems we need to understand how workloads categorize different kinds of information and the QoS part of stream processing helps there
Dave Maier - stream stuff is 1/3 of my time, 1/3 is putting your own structure on information, e.g. re-using attention, applications to personal data management, 1/3 is looking at data product management - from scientific data - observed and simulated data, and the fourth 1/3 is distributed query processing, data theory hybrids, trade latency for need to maintain distributed state, have to have some way to talk about catalog and coverage information, not just what kinds of data sources have but what coverage.
Dieter Gawlick - stream processing, started with conventional data bases, 90% of our work is done. we have sophisticated use of information distribution so people can get a response when they are not around. we did some interesting stuff expressionless data? - Demand analysis - you have something, who wants to have it. Oracle streams. all in products. as we go forward we have some ideas I'm looking for business as I do updates in a DB I get a stream. we don't start from events we think of everything as a history the sweet spot for a query is when something comes into the history - just what is on the brink of history. doing this demand analysis leads to a different model. I subscribe to a publication and it appears.
Gerhard Weikum - working on French woman query and others form this morning working with DB technology plus machine learning and ontologies are also an asset we would like to exploit.
Martin Kersten - battle between multimedia DB and database kernel guys; the media guys have thousands of hours of media and they run their jobs for hours - they need array based functionality. the database kernel people are battling the hardware - we can't get to the data fast enough the CPU is idle 90% of the time so what can we throw out - we toss out hashing, random access, go to streamed processing - a new generation of DB kernels will be 10X faster.
Abiteboul - web is a large knowledge base - worked on that - but the web is not yet XML - We should contribute to turning the Web into a large knowledge base. Precise questions get precise answers and not list of documents -- second problem is mixing XML docs with web services - exchange information that is static or dynamic web services super-excited bringing XML with embedded service calls, active data bases, query processing, tree structured data, and something very important is recursive query processing. to give you a flavor when you use Kazaa to find records you are doing recursive query processing on the web - not efficient - beautiful technology, now we have a use for it.
Timos Selis- formalize ETL by picking the right operators - if you have documents in thematic hierarchies - where does this doc sit where you find it ranking should take account of this - ontology information, semantics, technical part of the work is managing catalogs - building new thematic directories - extend search to use structural properties on top of keywords
Dave DeWitt - I used to do
optimization for Xquery but I think that's impossible - at the end of my query
I'm interested in a few hard core issues - what do we do with terabyte disk
drives which make parallelism hard and bandwidth per drive is poor. we
have to treat them as tape drives. optimization: queries are too complex,
statistics are not good enough - look at what is in the buffer pool, what pages
or tuples, push adaptive optimization, Im hard core db the web stuff is
interesting but not for me. me - the Vannevar Bush dream and
Hector Garcia-Molina - you have lots of systems interacting but they are autonomous - why will they cooperate - what is the incentive to share or cooperate, even forward messages let alone queries or results. have to think about how systems get incentivized and why we should trust them. lots of interesting problems there.
Laura Haas - I don't really do research any more - very interested in large scale systems - integrate information from diverse systems - range from federated to big warehouses using ETL - space with caching and replication in the middle - interesting physical DB design issue - also crosses organizational boundaries - data placement problems - also need systems to be more dynamic - now they are statically configured and an expert sets them up - want to automatically map to different data sources - the whole grid thing - service interfaces - potential for us - how we can use the kinds of services that are provided by the grid-op-sys people - as they provide security services or accounting/billing our systems should open up and do things like that.
Mike Stonebraker - my grand vision in 1974 was System R. over the last 25 years we have added (a) code in databases - postgres added code (b) spatial data, arrays and new data types - another add-on (c) active data bases - glued on triggers, they are second class citizens, (d) text - object relational systems we put on text but there is no ranking or probabilistic stuff (e) queue management added but not enough for streams (f) parallel database (g) distributed databases especially web ones and heterogeneous and everyone is putting and O?DP wrapper around what's there. we took a 1974 idea and glued on all this crap - we should start with a clean sheet of paper then we would not get architectures that said DB2 has the data and weblogix BEA has the code. that's a dumb distribution in a lot of ways. we have a gazillion of these things - they don't work together - rethink from a blank sheet - new query languages, interface, architecture. sad Ted Codd just died. new mandate needed to do this better.
Naughton - meta-level comment - if you do what is helpful for scientists it is of enormous benefit for non-scientists look at web or computers. will happen again. every interesting technical challenge exists in the scientific arena - personal db, data streams, web data bases, all in more manageable ways. and we can experiment on scientists more easily than on corporations. and funding agencies will support this. if we keep looking for motivation in problems that have already perceived commercial value industry will be doing it already.
Yannis Ioannidis - 2 things (a) if you want to buy a used car you can go on the web and bid; I see a need to buy data products - the value of a product may be a dollar price or in conjunction with how fast you get it or how complete the data is or how reliable - I want e-commerce techniques for query optimization - it's not just buying one thing; pieces of a query may run in different places; Mariposa+++ - multidimensional optimization using e-commerce techniques, lots of theoretical issues also. (b) personalization in databases - I should not get the same answer to a query as Ulman gets - much personalization in the Web but not in DB.
------------------------------------------------------------------------
[I think I heard a lot about streaming, about scientific data, about shared out architectures, and scattered other topics. few people are doing what they did for their PhD]
Dewitt - nobody said they were working on Xquery - (Franklin, Carey, are doing some). do you guys believe lots of stuff is going to be done.
Stonebraker - nobody mentioned concurrency control or access methods. we are all out on the periphery. Bernstein: there are people doing this. work in multiversion access methods.
Schek - there is some work on transactions in a more general sense.
Bernstein - dozen PhDs working on XML query optimization just not in research
Widom - people are doing optimization under the label "Xpath queries under streaming XML".
Hellerstein - few people needed storage.
Stonebraker -
a) Vote on the organizer's view of what the important problems are - you each get 3 votes on the topics. From yesterday's.
b) Tonight's session will be different - 1 hr on personal DB led by Gray & Weikum; 1 hr on "vision" - why did the last 4 reports have little influence - so we need this as a vision thing - just a brainstorming session want to end early today. Everybody is encouraged to stop when the discussion seems to have petered out, not just use all the time. The chairs will also do that.
Avi - who is audience for the report? Stonebraker - the audience is researchers picking research directions and also funding agencies. If we don't write something $1B goes to supercomputer centers and not us.
Hans - what did we present 15 years ago? Is that still in the vision?
DeWitt will do the diff.
What is IR?
70s-80s research focused on document retrieval
90s TREC reinforced the IR==document retrieval view
first bib records, then full text as time went on.
now doc retrieval is important - turned into web search other topics
question answering - finding short segments with particular info
cross lingual retrieval
distributed retrieval - now big
topic detection and tracking
multimedia retrieval - images, video, annotating them - starting up now,
learning and labeling images and video with text.
summarization
IR & databases
differentiated by unstructured/structured data
what about marked up text and semi-structured data?
text has tended always to have at least a few fields
recent database papers on nearest neighbor and similarity search
distributed peer to peer search
Web search
info extraction
text data mining
boundaries getting fuzzier
IR integrated with databases
many such proposals - now in XML context -go back to 70s
e.g. combine ranked search and the specificity of user queries
supporting a probabilistic framework is the key
integration vs. cooperation: do we really want one giant system? or should
we still have separate systems & separate capabilities but they work together
semantic web - "if you made the web a database" - this is make the web into a knowledge base and that won't happen - we've had a debate for decades about manual vs. automatic representations of what documents "mean" and both work better than either one but creating the manual versions is very hard. That's the lesson from the IR work
go for knowledge or statistics?
Stonebraker - every
Gray - why haven't you mentioned KDD (Knowledge discovery)? The field is very fragmented. Every product has a text retrieval bolt-on to their database.
Croft - anyone who talks text data mining is similar to IR - that works well together. The data mining in structured data - numbers -
Hellerstein - it's all based on clustering, etc. Same as machine learning - there is a common set of technologies.
Croft - IR people like NL - want to understand how to describe and satisfy an information need in an unstructured world. That gets us excited. Yes, we built inverted file technology for large data but we focused on NL and the DB people have different needs.
Stonebraker - If I ask
Google "what is the temperature in
Croft - We are working on that in the question answering world. You do want some context - you want more than just "73" as an answer (did it come from Bob's home page or where?) DB retrieval is fact retrieval so there is overlap. Some people work on extracting tables from text.
Stonebraker - this is similar to the first time I heard the discussion 20 years ago. The communities should cooperate and they don't.
Hellerstein - Not true! There has been a lot of overlap, now forced by the Web - the database community feels weak on text - and then we found that the IR stuff isn't that hard. Cohera and Whizbang are companies that had combined products. This is a healthy area.
Mike Franklin - How many people have been to SIGIR conferences? (few)
Mike Carey? We are organized into stacks. We should have a conference on a problem - not by community.
DeWitt- We could organize a conference on a topic. I like that idea.
(Martin Kersten?) - We don't need anything new - just join to attack a problem.
Croft - We do need something different than "you come to our conferences and vice versa".
Timos Sellis - A few more applications?
Croft - Want to have an NL query and not think about it.
Ullman - Re semantic web -
you talk about semantics but when you have to do something you do syntax.
If you take the temperature in
Croft - This history so far is that focus on deep understanding and semantics has not produced benefits in effectiveness. Learning patterns has been useful - applied probabilistically - the little words don't help.
Gray - People use mostly nouns and verbs - they can throw away the rest a telegraphic interface.
Ullman - Temperature is
special because you can't crawl the web and get temperature in
Gerhard Weikum - NL is something that is not that great for queries - you need to understand the text that is there.
Ulman - Google works because it is simple.
Pazzani - The Google
answer to "what is the temperature in
Serge Abiteboul - The problem is that if you have some info you can put it into plaintext and that's ridiculous. You have meta information and the question is when you have information if you start publishing meta-info it makes it much easier to avoid NL understanding.
Hellerstein - You can make schemas and just make things harder to use.
Abiteboul - Disagree
Bernstein - It's not just how you say things but how you learn - it must not be a manual activity to attach metadata.
Gerhard Weikum - takes over leading observations on DB, IR
business data is boring
action is e-science, e-culture, and entertainment
absolute facts is a myth created by accountants
uncertainty is fact, ambiguity is fact
hope for precise semantics based on universally agreed upon
ontologies and perfect metadata
IR:
similarity search with ranking is the best approximation to semantic search
DB:
can still leverage context - metadata, ontologies, multivariate distribution)
agree with Ullman - no such thing as pure semantics.
killer queries where Google, dbms fails
Find gene expression data and regulatory paths related to Barret tissue in the esophagus.
what are the most important results in percolation theory?
Are there any theorems isomorphic to my new conjecture?
Find information about public subsidies for plumbers
Where
can I download an open source implementation of the ARIES recovery algorithm
(needs to be decomposed into several pieces).
Which professors from D are teaching DBS and have research projects on XML
Who
was president of the
(can't do the decomposition and linking again)
"Who was the French woman that I met at the PC meeting where Peter Gray was PC chair?"
a) go through email archives and find which program committees I was on
b) then look to find the chairs of those committees
c) then having found that this was VLDB 95
get the list of the members and see that Sophie... came from Inria, Paris.
d) know that
Garcia-Molina - you are working in AI.
Weikum: Looks AI complete but you can do this with dumb things
Croft - Finding isomorphic theorems is the hardest one
Weikum - There is an "open math" project.
Croft - For question answering actually TREC does fairly well on that. There are a lot of factoid questions and the current systems are finding 70% of the right answers in the top one/two. But these are not factoid questions. ARDA now sponsoring ACQUAINT which looks at things like this in the intelligence domain. They want to find authoritative docs.
Weikum - People expect to type a few words at Google and get the answer the goal should be to minimize human time - you learn how to rephrase query
Agrawal - Some queries will get money: "what websites accept Visa/Mastercard but not Amex" - Amex will pay for that; but many queries people won't pay much for an answer we need to understand which queries have to be cheap and which can be expensive.
Weikum - Not sure money invested in the right things.
Timos - What is missing from DB ?
Weikum - Knowing which database to look in for the "gene expression data related to Barrett tissue" - there are many gene databases on the web and each has its own schema.
Halevy How much is understanding the query and how much is mapping it to formal SQL.
Silberschatz Some, I see how to map into a DB and the one about math I can't.
Weikum - There is the open math activity - suppose you have high school math text books and we have codified them into logic. Some inferencing capabilities in that - you can then mimic this. Pattern matching on XML.
Croft - What are the drivers for integrating IR & DBMS? You could build special purpose systems for each of your examples - or you could try to do this as IR. But where do you have to unify the systems or make them communicate?
Mike Franklin - Thats the key question
Stonebraker - If you want metadata - e.g. super-duper UDDI - that's what we bring to the table.
Weikum: Shouldn't we formulate this as a meta-query - not SQL.
Halevy-The fundamental problem mixing the two worlds is that we have a subquery in some formal world and we go to a repository and all we have is text. How do we come back with an answer to do joins?
Weikum - You could XML all the data you see on the web; but not sure which tags are important. Asked students to do researcher home pages and grossly underestimated the difficulty. And it still doesn't handle ambiguity
IR strengths
methodologically rich - statistics, probl, logic, NLP
appreciation and experience with machine learning
awareness of cognitive models for end-user intention and behavior
DB strengths
integrity, scalability, availability, manageability
system engineering
resource optimization - caching, memory mgt, query opt, physical design,
scheduling
Mike Franklin - Databases
allow manipulation - update, summarize, aggregate. This is more than IR does.
What about "find average salary of a
Croft: IR and Google are not synonymous.
Stonebraker Its easy to express your query once you have a table; the hard part is putting together the table.
Hellerstein - You don't even know what the breakdown should be. It is harder than you suggest.
Maier: Human attention is scarce resource. Where do you apply it? Writing metadata? Google harnesses this a little bit.
Weikum - DB & IR: issues & non-issues issues
exploit collective human input
use ML & ontologies
flexible ranking to XQuery
use ML to convert Web to XML
extend Google to deep web
break google monopoly
acquire broader skills
non-issues - we can do these
crawl structured data
simple IR on XML
polish XQuery and implement efficiently
homepage.xml schema
Again we need probabilities. As a special case we do traditional DB with result certainty 1.
Google is popular because of ranking and coverage
Ullman: No, they were popular when they had less coverage.
Weikum Afraid of Google having a monopoly. Want to have a peering system that spreads out queries.
Mike Franklin - Purest merger of DB & IR is in annotated scientific databases and this problem is important today. You need both DB & IR.
Mike Franklin. Info shadow is a problem. I look for Canyon Creek development near me and it is buried under a lot of stuff about the same name in Texas
Gray - We need spatial search and also time this pushes to a schematized metadata search not just flat text.
Lesk - also proper names
Bernstein - Yahoo does categories.
Pazzani - Google had a student contest for new feature and the winner was geographic search.
Lesk - we need to think about video
Hellerstein - Sensors and sensor fusion generate lots of info there that is somewhat structured.
Snodgrass - We need results on impossibility. Which IR tasks aren't worth trying.
Lesk: IR doesn't do much of that.
Croft - We try to categorize queries One TREC category is web search. We are learning about queries and what we can do, which will work
Gray - DB wants schematized search. IR doesn't. Syntax has lost to statistical search in IR. Is there a place where syntax works?
Lesk - OPACs - but Amazon seems to do better.
Gray - We should make the Web available as a study item for linguists, sociologists, etc.
Garcia-Molina Our group at Stanford is doing that.
Stonebraker - I have a 6th grade daughter - her teachers ask her to look up dinosaurs. It is hard to find things appropriate - journal articles worthless.
Lesk Search should do Flesch score and picture/text ratio.
Gray - Spam - learn what is not interesting and what is interesting in user context. Profile people
Croft - Contextual IR is an active area.
Move to what DB flavor in e-sciences
Hans - You said you wanted convergence - should ask about mergers
Bernstein - We focus on the query, The IR stuff is on preprocessing - organizing, he says MeSh/UMLS got them from 50% to 80% performance - likes thesauri
Lesk - Disagree schema helps very little. Described how the Internet Archive works. What could databases add?
Gray - It would run a lot faster it is unusable now.
Dave Maier - Tobacco docs - comparing with open lit - query by a smoke chemist is "what in these docs contradicts the journal literature?" It is hard to do that.
Croft - That's info extraction
Lesk - see Futrelle doing that.
Weikum - What is the purpose of the query e.g. insider trading. That's text and numbers. They organize ahead of time. In your case you didn't know in advance how the documents would be used (about smoke chemistry.)
Hans - What services do the sides provide? IR can do text categorization, DB can do engineering
Final vote show of hands: 3-1 for "I couldn't find it" over "it wasn't online"
a boring research topic?
a new frontier?
a means to keep standards people busy?
XML
rapidly adopted by industry
format for exchange of small/medium pieces of data when archived grows to large volumes
a data model - for a wide range of kinds of data. not relational - permissive typing, full-text search
the database community should be involved and perhaps concerned
XML issues
storage of XML
native vs. XML-relational
lesson from OODB - this is a business issue but the vendors are
not trying to block it
efficient representation and compression
key issue is interface - not clear whether it should be like a DB -
DOM, SAX, - or a query language -- needs work
revisiting old topics
database design
integrity constraints
concurrency control
access control
reinventing the world
universal query language for XML
problems with Xquery - promoted by W3C
focus on complex queries need simple filters, IR style search
too complex, ambitious, too much politics
can you really go from documents to data
people want to do what they did in SQL and others want doc search - this is hard
can we undermine Xquery with something better?
thinks we need small core OQL plus plug ins
running late - we need standard now
This direction deactivated by XQuery
scientific: is Xquery good or bad from a scientific viewpoint
politics: should we push for it
Weikum: SQL can be segmented.
Stonebraker - You can't talk about Xquery without talking about schema. That is what has to be subset. Big tension between what the doc guys would like and XML. XQuery does everything. It makes IMS look simple.
Gray - We had identified Google with IR and now we are identifying Xquery and Xschema with everything. We should think from a blank sheet of paper. OODB did not fail; it made object-relational possible. An approach here is that the train has left the station. We can't do much - there is an alternate path which is a much simpler query language. You should pursue that if you have a better idea.
Query optimization
for subsets of the language
tree structure is a new ball game - new index structures, cost models, etc.
depends on storage
revisit distributed query processing and view maintenance
everything being studied
Foundations
lots of work on semi-structured data
first-order logic and relational languages: strong
OQL/functional languages: reasonable
full-text search: messy
typing
much more complex than in relational world
not settled
query type checking, type inferencing, update consistency
very active area - people from DB theory, functional programming, etc.
all this again is active, but problems not simple, need more work.
real frontier: world is changing
old vs. new data management
Old New
closed world openeness
client/server P2P
distributed db web-scale data
query/answer subscription queries, stream queries
active db active databases + web services, service discovery
QBE interface new interfaces
research must focus on new issues - not single site data
beyond XML: the semantic web e.g. putting music on the internet was a very nice problem and the solution
was elegant (Kazaa) even though the lawyers disagree - uses little traditional technology
Widom - When did you add semantic web? I'm not responsible for that.
Abiteboul - All this is syntax. Makes Ulman happy; the most fundamental difference from relational DB to web is that you don't know the semantics.
Ullman: A high order bit for the report is "is querying XML too important to be left to W3C".
Stonebraker: A simple thing to say is that Xquery is a pile of crap and XML schemas are a pile of crap and we can't influence that. If we had a clean sheet of paper and wanted to do something right, we would focus on merging doc world and structured data. No standards body can do this.
Widom: People are implementing this it's too late.
Lesk- So what is an XML success story?
Abiteboul - Newspaper articles - All were in separate formats. Now they all use XML, particularly NewsML. Now we can merge 5 newspapers. You have parsers and editors and you can publish with very little effort.
Maier - The tools are very important. I studied data interchange formats and found that people agree on what things mean and without tools there weren't used. Some things left behind like array data.
Gray: Another plug for code+data; HTML started and people wanted to send script and when you send me XML I don't know what it is, just a bunch of tags, you have to send me the methods as well.
Abiteboul - Before methods you need metadata; then you provide code. We should be more active - things like UDDI are dirty. We should be helping here.
Gray - Dave Clark has a nice model for standards. There is a period when it's too early for standards and a period when it's too late - research and production phases. You need to be in between. I do not think we are at the standards phase with our ideas yet. We still need more prototypes.
Abiteboul - We're working on Active XML - XML with embedded calls.
Stonebraker - You said we have to worry about views and updates everything that came along with the relational model. It will be more complex this is what collapsed IMS.
Widom - You can write lots of papers.
Stonebraker You're too optimistic
Abiteboul We have a lot of models. In a distributed session you probably will do some integration of things that are very relational -- integrating at the tuple level.
Stonebraker - Part of the IMS difficulties were restrictions on views.
Abiteboul - OODB also had trees.
Ceri - If there's a lot of XML data out there we don't have the luxury of not dealing with it. Because hierarchical is the wrong way from scratch we can't ignore it.
Gray - The IMS data model was designed by blue-collar programmers, no theory. Don't postulate that there is no good hierarchical data model because IMS failed 30 years ago. Nobody has ever tried properly.
Bernstein - We can count on incremental forward progress. All the relational products are making big investments in XML. The data capture is inherently semi-structured. e.g. there is always a "comment" field.
Widom It would be absurd not to bless the area
Bernstein - But people do think it is boring, the same areas as ten years ago.
Hellerstein - We should focus on more IR things with XML and here is a list of plausible real problems (the "new" in Abiteboul 's last slide).
Widom - We spent all morning moaning about structured data and IR. This is a chance to do something about it. The next language should be more IR-ish what went wrong with Xquery?
Alon - Too many politics.
Lesk - Look at
Maier - The manuals are too thick - but SQL is no better.
Kersten - Query sessions are missing from this discussion - not just one query in isolation.
Iannis - Database people know queries. Actual users explore in unstructured ways and this often finds the most interesting things. Queries are important; but other things are too. Context, personalized stuff, other modes of interaction.
Lesk - ranking, visualization
Hans - processes, flows, combinations of services .
Ceri - I want similarity based browsing.
Snodgrass - We don't know if algebra is better than calculus.
Iannis It is not an issue of calculus vs. algebra. Declarative vs procedural is more important. I did a study: for simple stuff declarative is fine for more complex stuff procedural is needed. I don't know what kind of interface to give people. But, none of this has to do with XML.
Maier Why is there is no XML on the web. Are we doing anything to help with XML that is streaming?
Abiteboul - Two questions a) not much public XML but lots in industry b) how do you handle changing data?
Hellerstein - If you take queries over streams and add distributed databases you get routing which is a big topic in the networking area.
Pazzani - In a startup XML is being used as an interchange language and then it gets dumped into relational DBs. Also used as an intermediary for different screens, etc. Not much going on in XML data bases.
Bernstein - Quite a lot going on. Talk to vendors our product people can list many big time customer with lots of XML data.
DeWitt Is it simple or complex?
Bernstein - They want to do queries. There is a wide range of tasks. We can't move fast enough.
Widom There is no relational on the web either. We don't ignore RDBMS.
Bernstein - Research on XML as a data model also has room for innovation. Don't be negative about lines of traditional database research that can be applied to XML
Widom - The conferences are 1/3 XML now. It is not problem that there is not enough work.
Stonebraker - If you do research that competes with the vendors. That's not research. A big problem we have is that a lot of what we do is too close-in. Vendors will do this. We should do something Oracle is not doing.
Widom For example, query optimization for XML is not for researchers
Stonebraker - Yes. Don't do that. Leapfrog to the next data model. XML stinks it is too complex.
Widom - XML and XML schema are different the schemas are too complex
Hellerstein - Our CS colleagues won't fund us to work on XML query optimization, but many other things would sound better.
Agrawal This is not a firm statement but anecdotal info is that XML being stored right now is very simple. A relational tuple or other simple structure. The complexity of schemas that are coming is justified.
Widom For example, an airline record has a few structured fields and the comment field; that does not need all of XML.
DeWitt - We should take a stand. We're going to get blamed for Xquery and Xschema. People will say it came from the DB community. Ullman said we should repudiate any association with Xquery.
Widom - We can't do that we are already associated with it.
Croft - As an outsider reading DB papers I do blame you for Xquery.
Stonebraker This is easy we can say it is commercially important but we can do better.
Maier - What should we do as a data model if our goals are openness, peering, and so on?
Lesk: Whatever you do put <> around it and call it xml.
Widom - Nobody has a beef with just XML
Abiteboul - XML is just markup with simple markup, then the schemas come and made the problem.
Widom - Why did everything get so complex?
Mike Carey - what is our purpose as a community?
1 - produce great new ideas: ie write off Xschema and forget it
2 - structure the field (credits to Jim and Phil)
3 - educate the
workforce -
building industrial strength software claim Paradis better than DB2
Gray - Some of us - Dieter & me - work at companies with hundreds of PhDs who are doing the "how to make XML work" part. The community is working in this area, but where should the research work, not advanced development, go?
Carey - If we focused
entirely on research many of the
DeWitt - Should we focus on Xquery optimization so you're educating the work force for the current jobs?
Gray - the academic community completely ignored SQL. They said it was brain dead. That was fine, it happened anyway. I think we are in a similar state re XML- XSD-XQUERY today.
Stonebraker calls time. ten minutes to lunch.
results of the poll on the gong show
federated, heterogeneous 13
querying the internet 10
personal db 8
open source 5
privacy 5
visualization/new interfaces 5
probabilistic 5
autonomic 5
db tools/cybertools 4
experiment management 4
So how adjust schedule: Add querying the internet?
Hellerstein No, we've done that.
Bernstein - We discussed visualization in 1989 - it never goes away.
Dave Maier I would do experiment management.
Agreed to add that.
Stonebraker will do visualization, interfaces; frustrated that no one in this room is working on better UIs.
Aside: Abiteboul is working with BnF on archiving the web; they are changing the law to get legal deposit on French (country) websites.]
Carey
Brief history of federation
Multibase @1980.
many attempts since - every few years with new model
functional, relational, object-oriented, logic-based, XML
still not solved. last night we all brought it up again
will we ever solve it?
Haas
top ten reasons against federation - I get whines about all of them
10. Robustness: Systems fail, sources unavailable, more pieces mean more failures, so with robustness. (objections: DeWitt - google; Hellerstein - peer2peer; Stonebraker - your company is selling "sysplexes" which are single system of things that can fail; One piece of big iron will do better than 500 linux systems - sort of anti-federating.
9. Security: different systems have different security mechanisms, hard to have a coherent view of permissions; more points of failure, harder to make guarantee; and data is sometimes the "corporate jewels" and needs to be protected. Schek - look at e-health: would you trust that to a federation?
8. Updates recording change is not always an update. sources may not be databases; may have to go through an application API to do an update ACIDity - not all data sources support ACID properties transaction semantics not always possible. e.g. our current system doesn't support 2-phrase commit.
7. Configurability:- hard to set up too many architectures possible; many choices, little guidance. Lots of code to install and lots of connections to support
6. Administration - hard to keep up monitoring is hard; not all sources have tracking facilities; tuning is difficult; repairing is painful, need distributed debugging and you have to deal with different vendors
5. Semantic Heterogeneity: hard to identify commonalities - same terms, different meanings (but this is also a problem in a single system with the same data)
4. Insufficient metadata: all sources have different metadata with no uniform standard
3. Performance (data movement): need to move data, geographic distribution is common and the WAN is slow; large data volumes common and you can't just cache because changes can be frequent and hard to track, plus storage is not unlimited.
2. Performance (complexity): decision-support applns do complex queries and choices give big differences in performance. Some sources may not have enough CPU power and you need expensive functions of data.
1. Performance (path length) simple queries - even OLTP like - have huge overheads simple queries are common - easier to write, automatically produced. Should use one big query for performance but not written.
Mike Carey Q: we have had these problems for 20 years so why will federated succeed? A: It has to: integration is a top IT issue and not going away alternatives are expensive and/or painful write it by hand with 10 different APIs.
EAI/workflow solution consolidation - warehouse, data marts
Maier - How do you know about the data?
Bernstein - You do this in big meetings. Also simple scenarios exist - may not need high security or robustness for some applications. Customers know the data; need is great and compromise is possible.
Progress being made - 20 years of distributed query processing. Plumbing is in place; connectivity there. Reliable messaging. XML is now sort of basic agreement on how to exchange data. XML schema is a way of describing data. So we're getting closer.
What would we do if it worked?
retire? integrate the web - data google? p2p database?
Is research warranted? what are the most important topics?
Bernstein - The piece of this where we're making progress is semantics.
Maier - Look at blame allocation - be able to write down expectations of what the pieces should do and then be able to see what is happening.
Ulman - When you have enormous amounts of data you have to be uniform in your dealings. You can't write code for every 100 bytes. Once you have declarative languages you have to use query optimization.
Stonebraker - Cohera found out that you didn't mention is that semantic heterogeneity nearly always involves dirty data - and cleaning data is better done in bulk.
Maier - In health data they want to get something going. Federated is easier and if that doesn't work fast enough they might try to put it in a warehouse.
Haas - We are doing a service integration system based on db2.
Croft - Does federation include resource discovery? Does it include schema?
Haas - Federation includes metadata - I didn't consider resource discovery separately.
Halevy - To feel better about what we have done we need to focus on who are customers are. If people can put things in a warehouse they will do so. We need to go after the people who can't do this, who must put data in a warehouse.
Mike Franklin - Semantic heterogeneity not so bad. Security is more serious. They won't let people into their systems.
Carey - Sometimes all you have is a minimal interface
Stonebraker - You often have a non-relational interface which you have to wrap and then try to federate at a relational level - You might be better off at web level.
Garcia Molina - Why didn't anyone else vote for workflow; Distributed workflow is similar.
Hellerstein - On topic of reliability, the is lots of exciting work in networking. You can find key value in log number of links - p2p networks. db community don't talk to these people.
DeWitt - Distributed hash tables are not going to solve the world's problems.
Hans - You have underemphasized the problems of security and reliability. We can't live with low standards of accuracy - again see electronic patient record.
DeWitt - So what is the message? Laura says its impossible and Stonebraker says its done.
Lesk The intelligence community tells me you only get a keyhole into db - they refuse to federate.
Agrawal - They want "need to know" information sharing - minimal information to be delivered. We have paper coming out.
Stonebraker - Two great success stories & one great failure. (1) Airlines have been federating for years - very successfully. When you have only half a dozen elephants and a huge incentive it works. (2) Both Dell & Wal-Mart have federated their supply chains. One big enough elephant. (3) RosettaNet - electronics community trying to federate their supply chain. No big enough elephant and so it is not working. There is the same problem in autos.
Laura - Will work in specialized cases. we should solve some of these problems.
Hellerstein Tools are good. We won't solve all of these - we need to deliver tools to content managers.
Ullman I am the only CS person who says in public favorable things about TIA. The DARPA John Poindexter & AI community project. On 9/11 you had four guys with visible Al-Qaeda connections who went to 4 different flight schools with no connection to an airline. If you could integrate all these records, you could have asked the right query. This happens at two levels. a) How many al-Qaeda guys have been to flight schools? b) Even more ambitious - What strange things are going on? But how rare was this?
This is an interesting problem ; locality-sensitive hashing to focus on connections. We need to find just a few events that are the most interesting. The technology is not there yet but it is an interesting problem.
Gray - The license plate of the guys who were the Washington sniper was looked up 18 times in a few weeks. Nobody noticed this large number of lookups (and all were in the vicinity of one of the shootings) - because of different systems.
Ullman - You need Bayesian theory to tell you how unlikely something is.
Agrawal - Data Mining - Potentials and Challenges
observations
some transfer of data mining research into products
most in vertical applications
horizontal tools - SAS Enterprise Miner, DB2 Intelligent Miner
data mining in non-conventional domains
new challenges because of security/privacy concerns
DARPA initiative to fund data mining research
identifying social links using association rules
crawled about 1M pages and found Arabic names and charted links to make
a social network. the most popular name was Al Gore- they blew the
Arabic name identifier.
Hellerstein - Why not use a graph clustering algorithm?
Agrawal We are using association rules.
Ullman; - You need a strength measure.
Agrawal - website profiling using classification. training on labels like "Islamic leaders", etc.
Discovering trends using sequential patterns and shape queries - trends in patents, heat removal, emergency coolings, zirconium alloy, feed water. You look for a shape of the graph of % mentioned vs. year of those words. You sketch a "resurgence" in this case - V-shape - drop and then come back.
They are discovering microcommunities - tightly coupled bipartite graphs e.g. Japanese elementary schools, Australian fire brigades, - you find tight graphs and then you manually label the areas.
new challenges
privacy preserving data mining
randomizing the data in a way that destroys individual data but not the summarizing stuff
cryptographic approach
privacy preserving discovery of association rules
data mining over compartmentalized databases
frequent traveler rating model - with demographics, credit ratings, criminal records, etc.
TIA
was going to build a giant warehouse and got flack
perhaps one could use randomized data shipping or local computation.
Croft - System to return a probability that it can return relevant data and then you go get permission.
Stonebraker - My discomfort is that in theory all warehouses are built for data mining but in fact nobody is doing any of it and the vendors are going broke. The people I talked to were doing fairly simple things. No statistical expertise on their staff.
Agrawal - Lots of leading companies are doing this.
Weikum? - The field is approaching saturation. Interesting research but it is not for 10years. It's incremental.
Silberschatz: If we solve TIA in 10 years I would be surprised.
Ullman - even if you give me everything in the world integrated I still can't ask the right question. even more mundane - what is a gene.
Agrawal
some hard problems
past poor predictor of future
abrupt changes; wrong training examples
actionable patterns
how do we find what is surprising?
over-fitting vs. not missing the rare nuggets
how insure not overfitting - still hard
richer patterns
in medical domain - you need dags
simultaneous mining over multiple data types
text voice and structure data
when to use which algorithms
avoid the everything looks like a nail to a man with a hammer
automatic selection of algorithm parameters
CMU is now offering a degree in data mining (Tom Mitchell running program).
Pazzani - Management schools have been doing some of this for decades
Hellerstein - Many of us don't understand statistics - we should be educating ourselves. The undergraduates should be taught a bit more.
Gray - There is a popular book by Jiawei Han that is a nice intro and course. The challenge is that SAS and other tools are chauffeur driven. We have to make it easier. The science community has a size problem. Business has 1000s or 10ks of records or can subset and use quadratic or cubic algorithms. Science users have very large datasets (billions). They need log-n or linear heuristics. GenBank is about 40 GB right now - fairly small.
Hellerstein - We have an area that overlaps with statistical AI. We need to talk about what we contribute. people tell us our math skills are not up to the job.
Discussion
is datamining "rich" querying? is it "deeply" integrated with database systems. most current work makes little use of database functionality
should analytics be an integral concern of database systems
issues in datamining over heterogeneous data repositories.
Weikum: Should data mining be linked to data quality? Biomed people very anxious about this.
Agrawal - yes.
Pazzani: DB community could teach machine learning about data that doesn't fit in main memory. You must avoid things that take 10 passes over the data.
Snodgrass - Perhaps we should focus on summarization, visualization, then let people make deductions.
Ullman - I agree, this is one aspect but if all you have is visualization you need help. Suppose you have 10-D data and you have to know which are the most interesting dimensions.
Ceri - What about semi-structured data?
Ullman - I've seen it but it's derivative.
Abiteboul - I've also seen it.
Stonebraker - this is boring, what to do?
Density of incrementalism to insight is high.
Gray & Lesk: Suggested tossing the agenda and asking if anyone was passionate about anything other than selling your own research.
Schek: - We just have too few breaks people want fresh air (1/2 the group had left after the break).
Maier - So what? Should we plan the wake for DB?
Gray - In previous meetings there has been conflict - relational vs OO; logic programming, XML.
Stonebraker: I'm happy to present a controversial vision statement. What's the purpose of this meeting? In previous cases there were research branches - right now I don't hear the controversy - we are all working away - not at a turning point.
Gray - Why are we here? It's a 5 year interval - no specific agenda. It was not the field is in crisis. Last time we said text was going to be important but we have not done squat.
Schek - Other people did the work.
Ullman - I proposed 1 hr ago that the DB community should take charge of TIA. Use a systems approach. the spirit of TIA today is an AI spirit. Describe a wonderful vision with no idea how to do it. I'd rather work on version 1.0.
Croft: - Enumerate research issues in TIA
Ulman - Make clear it is a database rather than an AI issue.
Snodgrass - If you look at last reports they state 30-40 year goals at high level and of course we haven't reached it.
Agrawal: We should have some nearer term goals.
Croft - So what have we done in the last five years? (Xquery?)
Stonebraker -It looks to us like we're dead on our feet.
Gray - I'm excited but it's applications, and I'm filling in gaps.
Stonebraker OS people have quit doing that work - perhaps DB is a mature field and we should also drop things like query optimization. So I propose- we morph after dinner - 3 or 4 people to present visions of some sort that can't be achieved in ten years and listen to that.
Agrawal - One thing that would focus or excite us is some interesting application and TIA might be that thing It has database issues.
Gray - I have political problems with that. TIA has a big-brother overtone.
Stonebraker - This evening is anyone can get 15 minutes to say something that can't be accomplished in the next decade. No restrictions other than that.
Ulman - I understand the political issues about TIA - but it needs to be done. Just as city dwellers 5,000 years ago needed walls around their cities. It is a national need. The government gives guns to 1.5M people and relies on them not to invade your home. The political problem is to create analysts who get information and don't abuse things.
Stonebraker - This is a subset of heterogeneous federation and data mining.
Lesk: Three challenges e-science, TIA, personal memex [we've now killed 20 mins without getting anywhere]
Stonebraker: Integrating the deep web.
Gray: we have 24 hours left. Is the field really stagnating? Should we look for other careers?
Stonebraker - This discussion is very similar to the one 5 years ago.
Abiteboul - In 1981 people told me databases were dead.
Gray - What has been discussed so far is incremental. Oracle, IBM working in the mainstream. What should the researchers be doing?
Abiteboul But those guys don't publish so we need to do the same work.
Gray - They write a lot of papers.
Croft - Other areas are defining testbeds - so people could compare techniques. e.g. MT recently - was moribund and then defined a new measure 1.5 years ago and excitement is way up. (overlap of ngrams).
Ulman: When you define a measure of progress people make it increase.
Croft - You have to come up with good measures
Maier - Alon was saying for semantic integration what if we found something for people to try - a corpus of 1000 large databases.
Garcia Molina - Why is it bad to have the same list as 5 years ago. These are hard problems - should we only work on things we can solve in a year or two?
Bernstein: - It would be a problem if we had only the same solutions and were making no progress.
Gray - What progress have we made in the last seven years. Lots of things in data mining, cubes, auto-tuning, materialized views. In 1996?1976? Don Slutz was sending queries to DB2, SQL systems - 90% of the time he got the answer and the rest of the time he got a crash. Today you can use database systems and that is a result of research. Research in QA, fixing query optimizers.
Garcia-Molina: Is Google an accomplishment of last five years?
Silberschatz: Do we teach Google in DB community? -general yes I have a lot of data on my desktop and I don't use any database tools to manage it.
Bernstein - People use Outlook to manage their contacts (1/3 of the room?)
Hellerstein - failure with Gong show is that we talked about other people's work. (Laura had said this earlier).
Q: Should we just repeat the last report? Say it was the right program.
Croft: how do we move ahead? A number of people said this was a really exciting time - so much data around and people care about it.
Lesk: - Get people to do their own queries. just like IR. that's what made it exciting.
Maier - We have a lot of people who were at Laguna. Many of us are on their last research project. I can't do something which is ten years out. Maybe we have the wrong people.
Hellerstein - Disagree completely; wisdom has value. Phil can e.g. take risks at his stage in the career.
Garcia-Molina - The world is knocking on our door. There is a threat from terrorists and are we going to say there is nothing to do.
Maier: Who's bored with their current work? (only Ulman: puts his hand up) Carey and Halevy were the chairs of the two main conferences- What are the big issues?
Halevy - We had a lot of data mining papers and all but one were rejected.
Stonebraker - I can summarize as "in the past there has been a sea change" and in 1997 it was the web. Now we're just plodding along.
Gray - Webservices are a sea change. People can now publish info on the Internet, not just html.
Abiteboul - Deep web.
Franklin - Instead of a gong show we go around and you get 30 seconds for what excites you.
Stonebraker - we will spend 1 hr after dinner giving 2 mins to each person to say what you're excited about or to present a grand challenge.
applications
real-time enterprise
financial data feeds
supply chain management
sensors
environmental monitoring
RFID - radio frequency ids - e-zpass type - Gillette just ordered 500M at 10cents each.
Network monitoring
the sensors are the things that have triggered the big interest
what are the issues?
quality of service?
what's wrong with existing technology?
issues
push+latency: the data just comes but it ages fast
dbms - system controls data flow and optimizes throughput
sdms - sources control data delivery and you optimize latency
update followed by query - not fast enough
overload is possible - rate-based processing
DeWitt - I see no evidence that optimizing for latency & throughput are different. If you take a standard DBMS and forget about persistence it's the same.
Gray - Standard systems have response time thresholds and try to answer as much as possible. It is the same thing.
Croft - We also need different architectures to do 100K profiles against news wires.
Gray - In databases you treat queries as records and it works.
Maier -Is there always duality like that - queries and data invert?
adaptivity
loads change - so can not do a static plan
adaptive optimization issues
scheduling, load shedding, distributed bandwidth-aware optimization
correctness
semantics may not be deterministic
approximation, independent streams not synchronized
transactions do not seem central
update in place not the norm
overlap of answer arrival with query processing
mix queue-based processing with traditional storage
Silberschatz - At Lucent we worked on real-time billing - you append the call record in the database - and later you ask about the database.
Hellerstein - The only fun here is when you do distributed - push processing into the routers.