Working with Alex Szalay of John's
Working on ACM subcommittee chaired by Joseph Halpern of Cornell to put all CS research articles on the web --CoRR (http://arxiv.org/corr/home), and moderator of the Database Section of the Computer science online Research Repository (CoRR)
Working with David Lipmann, Jim Ostell and others at NCBI on portable PubMedCentral – a version of PubMedCentral that can be deployed (and federated) internationally. This is part of a larger effort to get all science literature online.
Microsoft sponsors of the following university research efforts (and I monitor the grants) :
Infrastructure and Education
2006 BARC External Research Fund Grants
Our research group (BARC – Scalable Servers) has a small (225k$) external research budget. Our focus is data management ( "information at your fingertips" ) and scalable servers -- our funding reflects that bias. We allocate these funds to projects with a process that is
· low-volume (at most 35k$ to about 8 grants),
· low overhead (one paragraph proposal from researcher),
· unrestricted gift for public domain research on the designated topic (no other strings attached).
The allocation is based on:
· reviews and ranks by an ad hoc email ballot of senior technical Microsoft staff (both research and product.)
based on reading the one paragraph proposals, following some links, and knowledge of the investigators accomplishments.
Johannes Gehrke, Cornell University, http://www.cs.cornell.edu/johannes
Data Privacy for Medical Data
Medical data about individuals is an invaluable resource for analyzing medical treatments, performing epidemiological studies, and developing models of diseases in the population. This general application area has already received considerable attention from the research community, and has been a motivation for our own previous work.
The general goal is to publish information about a population of patients without permitting identification of individuals or without leaking private data about individuals.
We plan continue our collaboration with scientists from
on privacy for medical data. There are truly lots of open research challenges, and we plan to concentrate on the following three. First, the publisher of the data does not know what background knowledge about the entities in the data and their relationships may be available to an attacker. We have some preliminary results from our work on L-diversity from last year’s funding, but there is no good and general way yet to quantify the power of an attacker and thus to quantify the risk of breaking an anonymization. Second, the publisher of the data has to find the right delicate balance between preserving privacy and utility of the published dataset. Third, medical data often contains complex relationships between entities that need to be preserved as much as possible, since these relationships are the targets of analyses. Besides its conceptual contributions, our work will result in software usable by the domain scientists and that we will also make available online. Weill Medical College
FY06 progress: Data Mining and Data Privacy for Medical Information Systems
Over the last year, we have made significant progress on foundational issues in data privacy. First, we have developed a new version of anonymity that is called l-diversity, and which removes several subtle, but possibly powerful ways of attacking the state of the art method of anonymizing datasets called k-anonymity. L-diversity, while only published in ICDE 2006, has become the standard method for data anonymization with citations in SIGMOD 2006, KDD 2006, and VLDB 2006. Second, we have taken first steps towards getting a general understanding of utility of an anonymized dataset; our method quantifies utility by making a novel connection between graphical models and log-linear models and anonymized data. Third, we have investigated practical methods for restricting information leakage through database views if secret information is defined through a database query. Our work resulted in papers ICDE 2006, SIGMOD 2006, and PODS 2006, and tutorials at ICDE 2006 and KDD 2006, and in anonymization software that is publicly available online for download.
Comparative Analysis of RNA Sequences, Structures, & Evolution using a Database
Our understanding of the structure, function, and evolution of RNA molecules has increased significantly with the recent improvements in sequencing and crystallography. The accuracy and detail that can be deciphered from RNA data is directly proportional to the creativity, design and speed of the analysis, as well the organization and accessibility of the data.
Over the past 25 years, the Gutell lab has developed successful sequence and structure analysis techniques and associated data management systems, producing an accurate prediction of Ribosomal RNA secondary structure and the identification and characterization of new structural motifs. However, the amount of sequence and structure data has increased by orders of magnitude, and the objectives of our analysis are much more ambitious and sophisticated.
Given this tremendous increase in data, we are developing a new computer system on SqlServer 2005, that will integrate more than 500,000 sequences (ranging from 100 to over 5,000 nucleotides in length), and associated structural and phylogenetic information, into a database for hosting analysis and ad-hoc data retrieval tasks.
Our goal is to perform the majority of the analysis within the SQL-server, including sequence alignment, covariation analysis and other types of statistical and phylogenetic queries using TSQL and CLR stored procedures. Preliminary work has demonstrated that this concept is fully functional. Next steps include:
· implementing diverse analyses in CLR stored procedures
· addressing scaling issues on clusters for large tasks
· developing web services for direct programmatic access
· developing a graphical user interface to access, manipulate, and analyze data.
Securing Escience Portals
Escientists often choose security-through-obscurity for their Internet-based collaborations, largely because existing tools for specifying security mechanisms and policies are geared toward the IT professional, not the domain scientist who likely does not have a rich knowledge of computer security. The increasing use of portals as an all-in-one approach for collaborative scientific exploration and discovery exacerbates this vulnerability, as now an attacker can access a wide range of important data via a single compromise to the portal frontend. A stronger emphasis should be made on escientists themselves leveraging their respective security infrastructures upon which they currently rely and trust (such as authentication sources and authorization servers), resulting in a federated approach.
In this project, we create tools by which domain scientists can securely collaborate by re-using as much as possible their existing authentication and authorization policies and mechanisms. We focus on the use of Microsoft Office Sharepoint Server 2007 as a portal for scientific collaboration, and furthermore our specific requirements are drawn from the emerging MSR-based Ameriflux portal (based on the Ameriflux Web pages at ORNL). In our first phase, the security requirements of the existing ORNL-based Ameriflux data are studied and instantiated in the MSR-based portal via Sharepoint mechanisms. In our second phase, our Sharepoint portal security will be expanded to comprehensively include the new features of the MSR-based Ameriflux portal (Grid access, CASJobs access, Microsoft Compute Cluster Edition access, etc.) Our output will be tools and guidance aimed at general escientists who lack a comprehensive computer security background.
, http://zircote.forestry.oregonstate.edu/terra/people/bevlaw.htm, firstname.lastname@example.org, Oregon State
Database Design for Ecological Data Collected at Multiple Spatial and Temporal Scales
The TERRA-PNW research group (http://wwwdata.forestry.oregonstate.edu/terra/) has been conducting outstanding research to quantify and understand the response of terrestrial ecosystems to natural and human-induced changes for more than 10 years. TERRA-PNW consists of teams focused on various scales of analysis from leaf photosynthesis to regional analysis of carbon and water balance. The biological, remote sensing and meteorological datasets represent multiple temporal and spatial scales, with time steps ranging from 10-1 -107 seconds, and spatial domains ranging from 10-2 – 106 meters. Due to the variety of data and the number of people involved, there exists a multitude of data formats, storage places and even storage media.
Lacking a consistent data managing strategy, the valuable data are not readily available for analysis. Research focuses are shifting from isolated short term studies to integrated syntheses, requiring easy access to various datasets over long periods. Funding agencies require community access to the datasets; our data are in demand by climate change research programs.
To produce a simple, workable approach to data archiving and data access for scientific groups such as TERRA-PNW, this project aims to make multi-scale data accessible on an enterprise RDBMS system. A MS SQL SERVER 2005 system will store the data organized in database clusters that can be queried and updated systematically. A web interface with ASP pages will be implemented to share data with other researchers. This activity will be important for management of the rapidly increasing amount of data, and will serve as a prototype for national and international research networks.
The complex mechanisms that underlie cellular functions are often encoded in biological graph datasets such as pathways and protein interaction networks. Methods for querying and mining these graphs have the potential to lead to fundamental advances in understanding basic cellular mechanisms, which in turn plays a critical role in rational drug discovery. The biomedical community has created over 200 graph databases, and many of these databases are rapidly growing in size. To fully exploit the wealth of information in these databases, effective and efficient approximate (sub) graph querying tools are critical. Note that the emphasis here is on approximate matching, as these graph datasets are noisy and incomplete. To address this need, we plan on investigating specifications of biologically relevant approximate graph match models, and on developing efficient index-based graph querying methods. We will apply our method for biological pathway analysis and for integration of pathways databases. In addition, we have a collaborator who has a NLP pipeline to parse biomedical dataset producing a graph for each paper (the nodes in the graph represent biological entities and a link denotes that these two entities were referred together in the same sentence). We will also develop graph matching methods for such datasets to enable sophisticated searching on biomedical literature that goes well beyond the current keyword search paradigm. We have made some initial progress on these issues producing a simple (but not highly scalable) algorithm. For a live demo see http://www.eecs.umich.edu/saga.
FY2005 Grant: Declarative and Efficient Methods for Biological Data Management
The focus of our efforts in 2005-2006 was on developing declarative and efficient query processing tools for biological sequence databases. In the last year, we have developed an efficient suffix tree-based tool for extracting frequently occurring sequence patterns (called motifs) from biological datasets. Such motif finding methods play a critical role in discovering new transcription factors and new protein folds. This new motif finder is guaranteed to not miss any matches and can scale to handle large datasets that are beyond the realm of existing motif finding methods. We have also expanded our declarative sequence querying tool and extended it to allow efficient cross genomic analysis – we can now scan entire mouse and human genomes for specific transcription factor motifs to produce putative gene targets in about a minute. In addition, we have also explored methods for protein structure classification. As new protein structures are discovered, the structure information is deposited in public database such as PDB. However these new structures have to wait for classification into protein fold families for 6-12 months as the classification process requires manual assignment. We have designed a new tool called proCC for automatic protein classification, which has a classification accuracy of 90%. Consequently, using proCC new structure deposits can immediately be classified into fold families with reasonable accuracy while they await a more thorough manual assignment. Our research has produced publications in BMC Bioinformatics (to appear), ICDE 2006, and VLDB Journal 2005. For more details please see http://www.eecs.umich.edu/periscope/.
Serious Games for Nursing Education – Prototype and Evaluation
The “Serious Games” concept has received increased attention lately. The idea is to use game metaphors and technology to support education and training by leveraging rich visuals, immersive situations, and interactive challenges to make learning more effective. Such approaches are being widely prototyped for the military, health care, emergency management, etc., even while its efficacy has yet to be scientifically proven. We propose a multidisciplinary project to develop a prototype serious game for education in nursing, specifically for pediatrics, where such applications are rare. In order to succeed in this area, one has to integrate challenging and life-like patient experiences found in nursing education with engaging game design, realistic art and graphics, and the technological and scientific capability of computer science. We have assembled a multidisciplinary team of scientists at SFSU including the
, Computer Science, and Design and Industry. Our proposed work will include: a) analysis of the specific application and its applicability to games (i.e. pediatric education); b) design of the game concepts, its flow, character behavior, graphics etc.; c) development of game prototype; and d) formal testing in Schoolof Nursing to determine its educational efficacy. We will use Microsoft SW and HW for development of this project. The money will be used for faculty release time and graduate students. Schoolof Nursing for Computing for Life Sciences will support this project with space, IT support and equipment, and $ 2 K matching funds. Requested funding: $ 20 – 35 K. Duration: 9-12 months. SFSU Center
An Early Warning System for Infectious Diseases
E-science techniques can be used to understand the source and spread of disease epidemics to contain future outbreaks, thereby possibly reducing the potentially massive toll on human life in underdeveloped nations. Even though epidemiological information is available for many pathogenic microbes, incidence reports are scattered and are difficult to summarize. We propose a system to automatically extract, classify and organize incidence reports based on geographic location and type for analysis by domain experts. Documents from the U.S. National Library of Medicine (http://www.pubmed.gov) and the World Health Organization (http://www.who.int) will be tagged according to their spatial and temporal relationships to specific disease occurrences, and presented graphically via a Microsoft Virtual Earth interface. We leverage our experience with the SAND Spatial Browser and Spreadsheet to provide spatial and textual search capabilities on the web (e.g., documents on "influenza" near "
Hong Kong"), possibly in conjunction with SQL Server. The spatial component of the search is facilitated by sorting entities with respect to the space that they occupy. Users can also see the phrases in the documents that satisfy the query, thereby facilitating easy verification as well as dismissal of false positives due to errors in identification of geographical references, which are difficult to avoid. Tools will also be provided to restrict the search result to a particular time period. In addition, newspaper articles will be tagged and indexed to bolster the surveillance of ongoing epidemics, while examining past epidemics using our system leads to improved understanding of sources and spreads of infectious diseases.
Progress 2005: Scalable Location-Based Services on Spatial Networks at Your Fingertips I completed and published a book, representing 12 years of effort, on multidimensional (including spatial) and metric data structures which includes our results on building a scalable framework for location-based services. In the past year, we developed several algorithms to perform nearest neighbor queries in a spatial network such as an incremental nearest neighbor algorithm and several distance join operations. Moreover, we formulated a new framework for query processing in a spatial network that precomputes and compactly encodes the shortest paths and distances between every pair of vertices. By making suitable assumptions about the general nature of spatial networks, we showed that for some spatial networks, the shortest path between every pair of vertices can be encoded in O(n) space, while the shortest path can be retrieved in average O(k log n) time, where k is the length of a shortest path. We also showed how to reduce further the average time to retrieve the shortest path to O(k log log n), but at the cost of increasing the storage requirements to O(n log n). We incorporated our framework into the SAND database system, and we developed a tool to enable its visualization for several spatial queries on spatial networks. We are currently preparing a paper for journal submission on this work. In addition, we applied the non-spatial network variant of our incremental nearest neighbor algorithm to point cloud models to compute the k-nearest neighbors in a dataset which we showed to be optimal in I/O.
Discovery and Analysis of User Information Goals and Usage in Media-Rich Scientific Websites
Web-based repositories, such as the Skyserver have come to play a critical role in dissemination of scientific information that is typically complex, heterogeneous and involves multiple modalities. Consequently, comprehending the diverse user-behaviors sustained by such web-sites is a complex challenge. Such insight is however, significant since it can help identify sub-optimal aspects of web-repository design, understand user behaviors and improve usability.
Current techniques for usage analysis are either based on the analysis of web-logs or analysis of page-content to discern usage patterns. Each has its limitations; usage-log analysis by itself is often insufficient to infer the information goals of the users or the extent to which they were satisfied while current content-based analysis techniques only work for cases where information is textual and browsing is supported through static, text-based hyperlinks.
We propose to develop a new approach to usage-analysis that combines information from web-log analysis with techniques that analyze the total information content of multimedia web pages. Our approach will allow patterns from usage-history to be correlated with information goals that can be supported by the underlying data. We will also develop techniques that analyze and predict user flow over text and non-textual hyperlinks as well as other interface modalities. Experimental validation of the approach will be conducted on the Skyserver. Long term impact of this research is envisaged to extend beyond scientific repositories like the Skyserver to the analysis of generic websites.
Implementing a Legacy Document Archive for the Sloan Digital Sky Survey
Over the last 8 years I have been working on the archive for the Sloan Digital Sky Survey. The main part of the survey ended in July 2005, and now we are running a 3 year extension. Soon the survey will end, and we realized that very little technical documentation exists beyond the archive of the email exploders, which contain over 100K emails, over the 12 years the survey was designed and in operations. Most technical decisions were discussed on these email exploders. I have copied a tar-ball of the whole mail archive to JHU. I propose to convert the collection of emails to a text searchable database, with intelligent markups and hyperlinks inserted into the text. The net result would be a set of intelligent annotations, linked to the underlying database, that can be archived and curated as an integral part of the data. Many other large collaborative projects operate this way: beyond the initial large proposal required for funding, most other documents are the email exchanged within the collaboration. I will also collect as many of the remaining technical drawings as possible (in electronic form) and insert into the archive as well.
2006 progress: Implementing a Scalable Scan-Machine Architecture for the SkyQuery Federation
Last year's grant was for implementing a parallel architecture for SkyQuery, handling a fuzzy spatial joins across geographically separate databases each containing ~108 objects. We had to drastically redesign the existing system, convert the underlying algorithms to zone-based implementations. The work was led by Maria Nieto-Santisteban, who has built a 10-way parallel system. We also had two visiting graduate students from the Technische Universitat,
(Benjamin Gufler and Tobias Scholl), who helped with the performance testing. The system is being deployed now over 10 servers. IN the next month we will integrate it with the CasJobs Workbench as a front end, then release it for public use. Munich
Managing and Mining Protein Folding Data
Understanding the mechanism of protein folding remains a grand challenge in structural biology. To address this challenge and to complement wet-lab experiments, computer simulations based on molecular dynamics are employed to study the folding process at atomistic level, often with femtosecond resolution. This has led to the accumulation of vast amount of folding data, which consists of a myriad of folding trajectories from various proteins. (A folding trajectory is a series of 3D structures along the folding pathway.) As a result, effectively managing and mining such data is becoming increasingly important and imperative. We propose to address this issue as follows. First, we will realize folding trajectory-oriented indexing schemes. Such indexes will be structure-based and can facilitate efficient access to single and multiple trajectories. Second, we will design scalable mining algorithms (1) to detect and predict various folding events such as nucleation, and (2) to identify common folding characteristics (e.g., common folding pathway) across trajectories of individual or different proteins. Such algorithms will be cognizant of domain knowledge from experiments and literature. We currently have ~20GB simulation data acquired from two biological groups. The PI will closely work with these groups to investigate the proposed approaches and verify the results. She also has full access to the computing facility at
for Computing for Life Sciences (1.2 TB storage). The funding will be mainly used for the PI's teaching-load release and to support student research associates. We are requesting $35K to support the first stage of this project. SFSU Center
Parts, Image, and Sketch based 3D Modeling Method for Domain Experts
Biologist use structural similarities and dissimilarities in anatomical features of a species to guide classification into the Linnaean system of species identification. 3D models allow an enhanced representation of the structure of an object in various fields, including Biology where they can aid in the specification by researchers and education of students in Linnaean classification. They communicate more effectively than static 2D images the principles of homologies, homoplasies, parallelism, and convergence that guide species specification and identification.
One factor inhibiting more extensive use of 3D models in Biology is the relatively complex and time-consuming process associated with generating these models. Biologists are not expert computer users, and typical 3D modeling software requires significant expertise.
To enable simple, rapid construction of 3D models, we have developed a prototype 3D modeling tool that allows image-guided morphing of 3D geometries by end users . The system is currently sufficiently stable to allow experimentation by researchers and students. The purpose of this grant is two-fold. First, we aim to refine the system, making it fully useful for biologists. Second, we aim to study, in detail, the use of 3D modeling in both education and research by integrating into the biology curriculum at
and into a laboratory environment at the Pacific Ecoinformatics and Computational Ecology Lab. Support of this grant will directly enhance the educational experiences of students and the ability of researchers to study species and the evolutionary characteristics that guided their development. San Francisco State University
 Jun Murakawa, Tracie Hong, Ilmi Yoon and Edward Lank, “Parts, Image, and Sketch based 3D Modeling Method”, Proceedings of Sketch-Based Interfaces and Modeling, SBIM 2006.