As someone who does a lot of work with natural language understanding and “semantic” technology, people frequently ask me how my work fits in with the  Semantic Web. The answer is that there is little overlap: the semantic web is about data standards for structured documents, whereas I develop NLP & machine learning technology to take unstructured documents and turn them into structured documents. My work is “semantic” in a different (and arguably deeper) sense — it’s about teaching machines to parse and, in a rudimentary way, to understand natural language.

For anyone confused about what the semantic web really is, here is my take:

Motivation behind the semantic web

The World World Web is a universal format for putting human readable documents on the internet. It has been a revolutionary step in human knowledge, allowing humans to share knowledge on any topic, and to perform all kinds of daily activities from shopping to travel reservations. In 2001, Tim-Berners Lee, the inventor of the World Wide Web, co-authored a hugely influential article which argued that the next major step in the evolution of the internet would be to develop common data standards so that machines can share information from multiple web sites in the way that humans can. Berners-Lee and his co-authors called this “the semantic web” (in fact, he had discussed these ideas as early as 1994). The idea has not quite taken off, but it has been slowly gaining currency in the tech community ever since (there seems to be a lot discussion about it lately, though Google Trends suggests otherwise).

The motivation behind the semantic web is, basically, that machines are not very smart. When I make travel plans online for a meeting in Chicago, I combine information from many different sites. I’ll check my calendar at Google for the best days to go, look up flight schedules and fares at Expedia to find a cheap flight on those days, check the address of the meeting using gmail, read hotel reviews at TripAdvisor to find the best accommodations near that address, check pricing and availability on those hotels using Hotels.com, and then book everything. It’s great that the internet allows me to do this, but, in fact, it’s a lot of work. I have to consult five different sites to complete the task, and make all sorts of decisions and inferences along the way.

It would save a tremendous amount of time if we could delegate complex scheduling tasks like this to a computer, but there are two major challenges. For one thing, the information on these different websites (e.g., calendar dates, flight schedules, hotel reviews, street addresses , etc) is easy for a human to understand, but difficult for a computer. When I see the words “Park Hyatt Chicago” at TripAdvisor, I know that this is a hotel and that the user review next to it gives me information about the quality of the hotel. Moreover, when I see the same name at Hotels.com with “$779.00” next to it, I know that these two sites are referring to the same hotel. I can see this immediately, without even thinking about it. Most computer programs aren’t so gifted. When they scan this page at TripAdvisor, they see a lot words. They don’t know what the words refer to (or even that “Park Hyatt Chicago” refers to one entity, not three). They don’t know what a user review is, or which words on the page are part of a user review rather than part of an adjacent advertisement. They may not even know that the TripAdvisor page and the Hotels.com page both refer to the same entity.

There is another challenge too. Even if we explicitly tell the computer that “Park Hyatt Chicago” refers to a hotel, that “5/5 stars” and “I enjoyed my stay” are part of a user review, and that $779.00 is the cost per night, the computer may not know how to reason with all of this information. It doesn’t know how to use the review and price information to make a good decision. It’s doesn’t even know that 5/5 stars is a very good thing, or that $779/night is far too expensive for most travelers. It doesn’t know how to combine the information from these two sites in order to infer the best value for your money.

How do we overcome these problems? The ideal solution would be to build machines that are genuinely smart, machines that can actually read and understand natural language. Smart systems like this have been the driving goal of artificial intelligence researchers for decades (so-called “strong” AI). Powerful and suggestive AI techniques are continually being developed, but we haven’t arrived at this ambitious goal quite yet.

The semantic web is an alternative. Rather than programming computers to be smart interpreters of ambiguous data, we make the data clearer and easier to interpret. To help computers understand the information on a page, the semantic web incorporates standards so that publishers of a web site can carefully structure their information in a simple machine readable format; entities like hotels, addresses, user reviews, and prices are all clearly identified. To help computers reason with this structures information, the semantic web incorporates standards so that publishers can explicitly tell computers what inferences they can make with the data.

Challenges for the semantic web

The semantic web obviates the need to solve very difficult AI problems, but it has challenges of its own. First, for the potential of the semantic web to be realized, its demanding data standards must become widely adopted. So far this hasn’t happened, though there has been progress. Second, publishers can markup data that is already well organized in databases (prices, product ids, dates, etc), but a great deal of information on the internet is and always will be in natural language format (news and blog articles, product reviews, emails and messages, etc.). It’s not clear how to incorporate this data into the semantic web vision. To encode this data using proper semantic web standards, we could either do so by hand (which is extremely cumbersome) or we’d have to write programs to parse and translate this natural language text into a suitable machine format (but if we had such a program, then we wouldn’t need the semantic web at all). Third, the semantic web represents an old, and arguably outdated vision, for how intelligent systems should reason. The semantic web standards are built around symbolic representations of entities, their relationships, and applicable rules for inference. Having this information is certainly better than nothing, but it’s not clear how much real-world activity and decision- making can be incorporated into this framework. Many computer scientists and philosophers have argued that symbolic approaches to AI and reasoning are severely limited (a common term of derision is GOFAI, or “Good Old-Fashioned Artificial Intelligence”).

Most likely, semantic web standards will prove more useful for sharing data than for sharing rules of reasoning and inference. Better organization of data can only be a good thing, but it’s important to see that the semantic web is mainly about smart data, not smart machines. If it does take off, it will only be a prelude to a web that is built with smarter machines (ones with more robust and flexible capabilities for learning, parsing text, and representing knowledge).

Alternatives

The semantic web has been slow in development, but that hasn’t stopped people from building programs that have “semantic web”–like functionality. One trend has been the proliferation of data APIs. APIs make structured data available for consumption by other machines, but they don’t typically conform to semantic web standards. It turns out to be quite difficult to foster agreement on any but the loosest data standards, such as XML (which is a data formatting standard, but not a data content standard). APIs make it possible to create mashups of data from many sites, but they don’t make it trivial—most API’s speak their own idiosyncratic language, and it takes work to connect a lot of different APIs together.

Another trend in this direction has been the development of programs that parse free text. The driving assumption behind the semantic web is that to enable computers to perform complex tasks on the internet, it will be easier to get publishers to agree on common data standards than to build machines that can parse unstructured data. It’s not clear that this is right. Oddly, for some types of data, it has been easier to build smarter machines. OpenCalais, now owned by Thomson Reuters, is a promising software service in this area. OpenCalais scans any piece of text and automatically extracts entities, facts, and events (including the entity “Park Hyatt Chicago”).  It even marks them up using the semantic web standard RDF. In a way, OpenCalais acts like a translator between plain html and the shiny new world of the semantic web. And in a way, as this kind of smart NLP technology becomes more widespread, it seems to undermine the need for the semantic web itself.