The following is a Assignment on Ontology and Linked Data and the evolution of data which lead to the use of ontology in web. Thanks to my teammates Bharath Gowda and Sravani Thota for working on the same.
Ontology And Linked Data
An ontology formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain, and may be used to describe the domain.
In theory, an ontology is a “formal, explicit specification of a shared conceptualization”. An ontology renders shared vocabulary and taxonomy, which models a domain – that is, the definition of objects and/or concepts, and their properties and relations.
Ontologies are the structural frameworks for organizing information and are used in artificial intelligence, the Semantic Web, systems engineering, software engineering, biomedical informatics, library science, enterprise bookmarking, and information architecture as a form of knowledge representation about the world or some part of it. The creation of domain ontologies is also fundamental to the definition and use of an enterprise architecture framework. The method of publishing structured data so that it can be interlinked and become more useful is known as Linked Data.
Linked Data builds upon standard Web technologies such as HTTP and URIs, but rather than using them to serve web pages for human readers, it extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.
Example of Linked Data are DBpedia and New York Times Linked Data. They extract structured information from Wikipedia and The New York Times to allow the user to ask sophisticated queries against the information available, and to link other data sets on the Web.
The Internet starts in the 1950s with the development of computers. This began with point-to-point communication between mainframe computers and terminals, expanded to point-to-point connections between computers. Packet switched networks such as ARPANET and Telnet, were developed in the late 1960s which lead to the development of TCP/IP.
As the Internet grew through the 1980s and early 1990s, many people realized the increasing need to be able to find and organize files and information. Projects such as Gopher, WAIS, and the FTP Archive list attempted to create ways to organize distributed data. Unfortunately, these projects fell short in being able to accommodate all the existing data types and in being able to grow without bottlenecks. One of the most promising user interface paradigms during this period was hypertext. Gopher became the first commonly-used hypertext interface to the Internet. While Gopher menu items were examples of hypertext, they were not commonly perceived in that way.
The need for presentation of data in a more illustrative method lead to the development of web browsers and HTML and thus began the 1st web as we know it or as we call it now,Web 1.0. The Web consisted of only information which was present in the FTP Archive in a more presentable method. The data was very limited and it consisted of mostly research papers and news articles. The data was stored as HTML. There was no need for a database or a data storage mechanism as the data was fairly static and was very limited.
Users of the web needed a way to interact with the website and share more content and views to the article. This lead the web to evolve itself to focus itself more on the community, sharing content or user-generated content, and an interchange of data. This lead to an explosion of information on the web. The shift from Web 1.0 to Web 2.0 is a direct result of the change in the behavior of those who use the World Wide Web, Web 1.0 trends included worries over privacy concerns resulting in a one-way flow of information, through websites which contained “read-only” material. Now, during Web 2.0, the use of the Web can be characterized as the decentralization of website content, which is now generated from the “bottom-up”, with many users being contributors and producers of information, as well as the traditional consumers. Web 2.0 websites allow users to do more than just retrieve information. Web 2.0 websites allow users to do more than just retrieve information. By increasing what was already possible in “Web 1.0″, they provide the user with more user-interface, software and storage facilities. This has been called “Network as platform” computing.Users can provide the data that is on a Web 2.0 site and exercise some control over that data. These sites may have an “Architecture of participation” that encourages users to add value to the application as they use it.
The need for dynamic data on webpages lead to the use of databases. The webpages used databases for storing the dynamic content and user interactions. The articles along with the tags and user comments are stored in a database for enabling dynamic updates of the webpages. Blogspot, WordPress are examples of this type of webpages.
Some webpages tried to focus on providing more information online. This lead to the evolution of web databases. Web Databases are online data bases which can be queried using forms. They usually contain information which can be searched on specific keywords. Examples of this is Yahoo! Autos, Google Base etc. They mostly contains huge collection of data which can be used by queried and retrieved by the users.
As more and more users, companies and governments started making their data public though the web we started to consider web as database. As the data is not centralized, each source has partial data of the topic. We hence needed to integrate or bring together the related information. Thus the information available was tried to be linked between each other. This linking can be used to by other webpages for providing more information on the same topic. This lead to the idea of linked data or the concept of ontology.
Web 1.0 consisted of mostly webpages linked between each other using hyperlinks. Web 1.0 is a platform of information that is read only. It consists of static and non-interactive web pages that at most allow for an interchange of documents. In 1996 for were 45 million global users. The focus of Web 1.0 was on companies and owning content. The data was mostly stored in traditional databases which had content inserted and edited by only the web-page owners. The data was limited to the view of the webpage owner. The data was usually very minimal and was used only as a publishing tool.
Web 2.0 features such as containing data; linking to multiple areas such as books, reviews, users, and catalogs; has services for individual recommendations; supports tagging and traditional subject headings; and supports the needs of the community. The Characteristics of Web 2.0 are: rich user experience, user participation, dynamic content, metadata, web standards and scalability. Further characteristics, such as openness, freedom and collective intelligence by way of user participation, can also be viewed as essential attributes of Web 2.0.
Wikipedia is a fine example of Web 2.0 website. Wikipedia is a multilingual, web-based, free-content encyclopedia project based on an openly editable model. Wikipedia is written collaboratively by largely anonymous Internet volunteers who write without pay. Anyone with Internet access can write and make changes to Wikipedia articles. Thus it used the power of each user of the webpage rather than a single employee or a group of employees. Hence the information is more precise, better quality and up to date compared to the one sided version of a single employee.
A Important part of Web 2.0 is the social Web, which is a fundamental shift in the way people communicate. The social web consists of a number of online tools and platforms where people share their perspectives, opinions, thoughts and experiences. Web 2.0 applications tend to interact much more with the end user. As such, the end user is not only a user of the application but also a participant by:
- Contributing to RSS
- Social bookmarking
- Social networking
One of the issue with web 2.0 is information is represented as text. Its difficult to seek an information in the text. Thus, the web evolved into Web 3.0 or the Semantic Web
Ontology is developed to solve many issues that exist today. Sharing information and common knowledge is one of the key goals for the development of ontology. For example there are lots of map applications available today. If there is a new landmark which needs to be identified and labeled, then user has to update the landmark in all the map applications. If these Web sites share and publish the same underlying ontology of the terms they all use, then computer agents can extract and aggregate information from these different sites and label the landmark in all the applications. Separating domain knowledge from the operational knowledge is another use of ontologies. We can describe the emergency recovery procedure during engine failure according to the specification and implement a program that does this independent of the type of engine. We can then develop ontology for automobile engine and apply the algorithm for recovery. We can also use the same algorithm for power generator if we feed power generator component ontology to it.
Ontology can consists of different components of which the important ones include
Classes which describe the concept in are the focus of most ontology. Classes are also referred to as type, sort, universal and kind. Classes represent a group of individuals that share common characteristics. For example both 2-wheele and 4-wheel vehicles share common characteristics like they are used for travelling, they use engines to function etc. 4-wheel can again be classified into car and truck. A class can have subclasses that are more specific than the super class. Classes can also share relationship with each other.
Individuals also known as instances are the basic unit of ontology. They are the things that ontology describes; individuals can be persons, automobiles, machines as well as model abstract objects like person’s job or a function.
Relations or relationship in ontology is describing the way in which individuals are related. Relations normally are expressed between individuals or also between the classes. The power of ontology comes from the ability to describe relations. Important types of relations are is-a-superclass-of and is-a-subclass-of. From the above example we can say that car is a subclass of 4-wheel and 4-wheel itself is a subclass of vehicle.
Other components of ontology are Attributes, Restrictions, Rules, Axioms and Events.
It is a set of activities that concern with the ontology development process, lifecycle, the methods and methodologies for building ontologies, the tool kits and languages that support them. Ontology reflects the structure of the world. It is often about the structure of concepts. The actual physical representation is not an issue in ontology. Whereas an Object oriented class structure reflects the structure of data and code. It is usually about behavior (methods). OO class structure describes the physical representation of data (long int, char etc.)
Domain ontologies are descriptions of particular subjects or domain areas. Domain ontologies are the “world views” by which organizations, communities or enterprises describe the concepts in their domain, the relationships between those concepts, and the instances or individuals that are the actual things that populate that structure. Thus, domain ontologies are the basic bread-and-butter descriptive structures for real-world applications of ontologies. They provide a common conceptual vocabulary to members of a virtual community of users who need to share their information in a particular domain such as in medical, banking, tourism etc. The identiﬁcation and deﬁnition of concepts that describe the domain knowledge requires a certain consensus. Generally, each member or sub community holds some knowledge, he has its own view on the domain, and he describes it with his own vocabulary. Thus, to reach a consensus allowing to reﬂect a common view of the domain can be a diﬃcult task and even more harder if members are geographically dispersed. One way very widely used is to start from pre-existent elements in the domain: text corpus, taxonomies, ontology fragments, and to exploit them as a basis for gradually deﬁning the domain ontology. For example the word ‘mouse’ may be modeled differently in Animals ontology but it will be modeled differently in Computers ontology. As the ontology world expands there may arise need to merge the domain ontology to general representation, since domain ontologies represent concepts in very specific and eclectic ways and are often incompatible. There can be different ontologies in same domain also due to different perceptions of the domain based on ideology, cultural background or because of representing in different languages.
Gene Ontology (GO)is a collaborative effort to address the issue of consistent descriptions of gene products in different databases. The project aims to maintain and develop vocabulary of gene and gene product, annotate genes and gene product and provide tools to access for easy access of all aspects of data provided by the project. The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes in a species-independent manner
- The cellular component ontology describes locations, at the levels of subcellular structures and macromolecular complexes. Examples of cellular components include nuclear inner membrane, with the synonym inner envelope, and the ubiquitin ligase complex, with several subtypes of these complexes represented.
- A biological process is series of events accomplished by one or more ordered assemblies of molecular functions. The biological process ontology includes terms that represent collections of processes as well as terms that represent a specific, entire process. Generally, the former will have mainly is_a children, and the latter will have part_of children that represent sub processes. Also see “is_a or part_of” below
- Molecular function describes activities, such as catalytic or binding activities, that occur at the molecular level. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions, and do not specify where or when, or in what context, the action takes place. Molecular functions generally correspond to activities that can be performed by individual gene products, but some activities are performed by assembled complexes of gene products. Examples of broad functional terms are catalytic activity, transporter activity
XML is a meta language that allows users to define their own markup languages by defining data, data types, syntax. It is a general purpose markup language for documents containing structured information. A XML document contains elements that can be nested and that may have attributes and content. XML namespaces allow specifying different markup vocabularies in one XML document. XML schema serves for expressing schema of a particular set of XML documents.
XML describes the syntax but doesn’t provide the semantics. Thus it should be extensible for the semantics of the web. An RDF is a framework for representing information about resources in a graph form. It was primarily intended for representing metadata about WWW resources, such as the title, author, and modification date of a Web page, but it can be used for storing any other data. It is based on triples subject-predicate-object that form graph of data. All data in the semantic web use RDF as the primary representation language. RDF document consists of a collection of statements in XML format. The root node of the XML document is “RDF,” followed by a list of elements, each of which corresponds to a statement.
Information is represented by triples subject-predicate-object in RDF.
- Subject: The resources being described by RDF are anything that can be named via a URI.
- Predicate: A property is also a resource that has a name, for instance Author or Title.
- Object: A statement consists of the combination of a Resource, a Property, and an associated value
A combination of them is said to be a Statement (or a rule).RDF itself serves as a description of a graph formed by triples. Anyone can define vocabulary of terms used for more detailed description. To allow standardized descrip tion of taxonomies and other ontological constructs, a RDF Schema (RDFS) was created together with its formal semantics within RDF. RDFS can be used to describe taxonomies of classes and properties and use them to create lightweight ontologies.
RDF Schema (RDFS)is extending RDF vocabulary to allow describing taxonomies of classes and properties. RDF is an ontology language with a basic set of resources, properties, and statements.
RDF Schema provides a higher level of abstraction than RDF.
- specific classes of resources ,
- specific properties,
- and the relationships between these properties and other resources can be described.
All resources can be divided into groups called classes. Classes are also resources, so they are identified by URIs and can be described using properties. The members of a class are instances of classes, which are stated using the rdf: type property. Note that class and a set of instances does not have to be the same. The set of instances is the extension of the class, and two different classes may contain the same set of instances.In RDFS a class may be an instance of a class. All resources are instances of the class rdfs: Resource. All classes are instances of rdfs: Class and subclasses of rdfs: Resource. All literals are instances of rdfs: Literal. All properties are instances of rdf: Property.Properties in RDFS are relations between subjects and objects in RDF triples, i.e., predicates. All properties are defined domain and range.
The OWL is a language derived from description logics, and offers more constructs over RDFS. It is syntactically embedded into RDF, so like RDFS, it provides additional standardized vocabulary. RDFS and OWL have semantics defined and this semantics can be used for reasoning within ontologies and knowledge bases described using these languages. To provide rules beyond the constructs available from these languages, rule languages are being standardized for the semantic web as well. Two standards are emerging – RIF and SWRL. OWL comes in three species – OWL Lite for taxonomies and simple constrains, OWL DL for full description logic support, and OWL full for maximum expressiveness and syntactic freedom of RDF. Since OWL is based on description logic, it is not surprising that a formal semantics is defined for this language.OWL is divided following sub languages.
- OWL Lite
- OWL (Description Logics) DL
- OWL Full
OWL Full and OWL DL support the same set of OWL language constructs. Their difference lies in restrictions on the use of some of those features and on the use of RDF features. OWL Full allows free mixing of OWL with RDF Schema and, like RDF Schema, does not enforce a strict separation of classes, properties, individuals and data values. OWL DL puts constraints on the mixing with RDF and requires disjoint-ness of classes, properties, individuals and data values. The main reason for having the OWL DL sublanguage is that tool builders have developed powerful reasoning systems which support ontologies constrained by the restrictions required for OWL DL. OWL Lite is a sublanguage of OWL DL that supports only a subset of the OWL language constructs. OWL Lite is particularly targeted at tool builders, who want to support OWL, but want to start with a relatively simple basic set of language features. OWL Lite abides by the same semantic restrictions as OWL DL, allowing reasoning engines to guarantee certain desirable properties.
Semantic web is an extension to current World Wide Web. In this the information stored contains a well-defined meaning, enabling people and computers to work in cooperation. The Semantic Web is a web that is able to describe things in a way that computers can understand. Consider below lines
- California is the most populated state in US.
- Sacramento is the capital of California.
- Silicon Valley is located in the bay area of California.
Sentences like the ones above can be understood by people. But how can they be understood by computers? Statements are built with syntax rules. The syntax of a language defines the rules for building the language statements. But how can syntax become semantic? This is what the Semantic Web is all about. Describing things in a way that computers applications can understand it. The Semantic Web is not about links between web pages. The Semantic Web describes the relationships between things (like A is a part of B and Y is a member of Z) and the properties of things (like size, weight, age, and price).
It is important to have the huge amount of data on the Web available in a standard format, reachable and manageable by Semantic Web tools. Furthermore, not only does the Semantic Web need access to data, but relationships among data should be made available, too, to create a Web of Data (as opposed to a sheer collection of datasets). This collection of interrelated datasets on the Web can is nothing but Linked Data.
The evolution of the current Web of “linked documents” (hyperlinks) to a Web of “linked data” is steadily gaining importance. Semantic web is termed as synonym for Web3.0. There are some who claim that Web 3.0 will be more application based and center its efforts towards more graphically capable environments, “non-browser applications and non-computer based devices…geographic or location-based information retrieval” and even more applicable use and growth of Artificial Intelligence. For example, Conrad Wolfram, has argued that Web 3.0 is where “the computer is generating new information”, rather than humans.
Ontology LanguagesOntology Languages are formal languages used to construct ontologies. They allow the encoding of knowledge about specific domains and often include reasoning rules that support the processing of that knowledge. Ontology languages are usually declarative languages, are almost always generalizations of frame languages, and are commonly based on either first-order logic or on description logic.
Ontology is used for classification, navigation, concept identification, query, and reasoning. As SOA software have various kinds of structures such as application architecture, collaborations, services and workflows, it is advantageous to divide an ontology system into different kinds of ontology systems, one for each kind of software artifacts, that is, software-oriented ontology. Each ontology system focuses on one aspect of SOA software development, and they can cross reference. For example, we can have the following ontology systems:
- Application ontology (AO): This defines concepts and relationships related to applications;
- Collaboration ontology (CO): This defines various collaboration templates with associated workflows and services, and the CO cross references to AO, WO and SO for easy service and workflow identification;
- Workflow ontology (WO): This defines concepts and relations of workflows. In WO, specific workflows from different domains are classified and relations are also specified facilitating collaboration;
- Service ontology (SO): This defines concepts and relationships of services.
Ontology enabled SOA is derivation of SOA that assumes a direct exchange of semantically rich messages between processing components. Ontology enabled SOA can be combined with any SOA-compatible architecture (WSDL/SOAP) and with any ontology language. The style can be gradually refined into derivative styles to support additional service related activities. An ontology language must be used to express a schema underlying messages. This constraint explicitly addresses the semantic interoperability of messages.In most of the modern applications of SOA, XML and XML Schema provide a unified syntax and vocabulary definition mechanism to messages. However, the XML Schema language addresses structural aspects only, leaving semantics of a defined vocabulary implicit. This leaves the problem of semantic interoperability among processing components unsolved but, on the other hand, does not restrict the flexibility in defining domain-specific vocabularies. Ontology can provide a domain vocabulary, semantics of which is precisely defined in terms of ontological primitives of the underlying language.LimitationThe ontology languages such as RDFS and OWL are designed to have fixed semantics and, therefore, cannot fulfill the extensibility requirement with respect to the schema semantics. The user is limited to model an application domain in terms provided by an ontology language. This means that the more restrictive an ontology language is the less flexible an Ontology enabled SOA based on this language becomes and, therefore, the more it diverges from the general requirements to SOA. In order to achieve the level of flexibility required by SOA, semantics of an ontology language must be extensible. However, at present, such languages are not available and it is difficult to foresee whether they will appear in future. If an extensible ontology language is unavailable then the least restrictive one (e.g. RDFS) might be good enough candidate.
- The Gene Ontology Consortium (Jan 2008). “The Gene Ontology project in 2008.”. Nucleic acids research 36 (Database issue): D440–4. doi:10.1093/nar/gkm883. PMC 2238979. PMID 17984083.