|
An estimated 10 billion static web pages and 60 billion dynamic web pages now exist on the world wide web,
and the numbers keep growing. Web mining is the process of exploiting and utilizing the information of these web pages. The basice
types of web mining include: web content mining,
web structure mining, and web usage mining. Our technology for web data mining and content scraping retrieves web content automatically.
The content on a web page is visible to public and private groups of users.
The data is accessible via the HTTP communication. Theoretically, any web page can
be mined. However, the structure of web pages is drastically distinct. This means
that the content of web pages is not machine-readable. In addition, some web pages
are incorporated with a series of measurements that protect against web mining.
These can include hidden dynamic variables, dynamic cookies, session cookies, limited bandwidth,
log in systems, and SSL/TLS secure communications. All of these obstacles make it
difficult to create a single tool that works for all web pages, especially for secure
and dynamic web pages. We fully understand the characteristics of web pages, and we
have designed and developed a web mining tool that overcomes all of a website's
protective features. Our web data mining tool is flexible, adaptable, scaleable and web server-friendly.
This allows us to scrape any content from any web page.
Technincally, web mining consists of Web usage mining, Web structure mining, and
Web content mining. Web mining allows you to look for patterns in data through
content mining, structure mining, and usage mining. Web usage mining refers to
the discovery of user access patterns from Web usage logs. Web content mining
and Web usage mining. Web usage mining is the application that uses data mining
to analyze and discover interesting patterns of usage data on the web. Most
graphs are involved in determining frequent traversal patterns or large
reference sequences from physical layout, such as the most frequently visited
paths in a Web site.
Web mining, when looked upon in data mining terms, can be said to
have three operations of interests - clustering, associations, and sequential
analysis. Recipient technologies that demand for user profiling and usage
patterns include recommendation systems, Web analytics applications, application
servers coupled with content management systems and fraud detectors. The
improvement of electronic reference services, from the reference desk to
cyberspace, strongly support the ongoing research interests of the
web mining tools. Recipient technologies include user profiling, usage analysis,
ontology extraction for the Semantic Web, intelligent search and recommendation
systems based on user preferences, page content, and site semantics.
Most web content mining systems used wrappers to map documents to
other data structures, but this is highly dependent on the the layout and
formatting instructions inside web pages. The Web data mining technology solves
half structure data pool model and half structure data model inquiry and the
integrated question.
The technology aims to bring together various perspectives on Web mining and stress the
synergy effects between Web usage mining, Web content mining and Web
intelligence, and of Semantic Web Mining. It is also quite different from data mining because Web data are mainly
semi-structured and/or unstructured, while data mining deals primarily with
structured data. One possible approach is to personalize the web space -- create
a system which responds to user queries by potentially aggregating information
from several sources in a manner which is dependent on who the user is. It
is possible to determine such information as the number of accesses
to the server, the times or time intervals of visits as well as the domain names
and the URLs of users of the Web server.
The machine learning community has a long tradition in extracting structure
from large scale data collections. Next also needs one kind of half structure
model to extract, namely automatically extracts half structure
model from the existing data. As these systems grow larger,
however, the users feel the need for more structure for better organizing their
resources.
Web content mining is the process to discover useful information from the
content of a web page. Web content mining aims to extract/mine useful
information or knowledge from web page contents. In the past few years, there
was a rapid expansion of activities in the Web content mining area.
Web content mining sometimes is called web text mining , because the text
content is the most widely researched area. The other kind of the web structure
mining is mining the document structure. Web content mining is related but
different from data mining and text mining. Moreover, the data sets in Web
Mining are extremely large. Web mining - is the application of data mining
techniques to discover patterns from the Web. Web Mining applies data mining
techniques on the web. Web usage mining is the application that uses data mining
to analyse and discover interesting patterns of user's usage data on the web.
The information gathered through Web mining is evaluated by using traditional
data mining parameters such as clustering and classification, association, and
examination of sequential patterns.
The goal of Web structure mining is to generate structural summary about the Web
site and Web page. Technically, Web content mining mainly focuses on the structure of inner-document, while Web structure mining tries to discover the link structure of the hyperlinks at the inter-document level. Based on the topology of the hyperlinks, Web structure mining will categorize the Web pages and generate the information, such as the similarity and relationship between different Web
sites. Web structure mining can also have another direction -- discovering the structure of
Web document itself. This type of structure mining can be used to reveal the structure
(schema) of Web pages, this would be good for navigation purpose and make it possible to
compare/integrate Web page schemes. This type of structure mining will facilitate introducing
database techniques for accessing information in Web pages by providing a reference schema.
The detailed works focus on how to discover interesting and informative facts
describing the connectivity in the Web subset, based on the given collection of interconnected
web documents. The structural information generated from the Web structure mining includes the
follows: the information measuring the frequency of the local links in the Web tuples in a
Web table; the information measuring the frequency of Web tuples in a Web table containing links that are interior and the links that are within the same document; the information measuring the frequency of Web tuples in a Web table that contains links that are global and the links that span different Web sites; the information measuring the frequency of identical Web tuples that appear in the Web table or among the Web
tables. In general, if a Web page is linked to another Web page directly, or the Web pages
are neighbors, we would like to discover the relationships among those Web pages.
The relations maybe fall in one of the types, such as they related by synonyms or ontology,
they may have similar contents, both of them may sit in the same Web server therefore created
by the same person.
Another task of Web structure mining is to discover the nature of the hierarchy
or network of hyperlink in the Web sites of a particular domain. This may help to generalize
the flow of information in Web sites that may represent some particular domain, therefore the
query processing will be easier and more efficient. Web structure mining has a nature relation
with the Web content mining , since it is very likely that the Web documents contain links, and they both use the real or primary data on the Web. It's quite often to combine these two mining tasks in an
application.
Web mining - is the application of data mining techniques to discover patterns
from the Web. According to analysis targets, web mining can be divided into three different
types, which are Web usage mining , Web content mining and Web structure mining .Web usage
mining is the application that uses data mining to
analyze and discover interesting patterns of usage data on the web in order to better
understand and serve the needs of users or Web-based applications. It is an activity that
involves the automatic discovery of patterns from one or more Web servers. Organizations
often generate and collect large volumes of data; most of this information is usually
generated automatically by Web servers and collected in server log.
The web analysis tools simply provided mechanisms to report user activity as
recorded in the servers. Using such tools, it was possible to determine such information as
the number of accesses to the server, the times or time intervals of visits as well as the
domain names and the URLs of users of the Web server. However, in general, these tools provide
little or no analysis of data relationships among the accessed files and directories within
the Web space. Now more sophisticated techniques for discovery and analysis of patterns are
now emerging. These tools fall into two main categories: Pattern Discovery Tools and Pattern
Analysis Tools.
Web content mining is the process to discover useful information from the content of a web page. The type of the web content may consist of text, image, audio or video data in the web. Web content mining sometimes is called web text mining , because the text content is the most widely researched area. The technologies that are normally used in web content mining are NLP ( Natural language processing ) and IR ( Information retrieval ).Web structure mining is the process of using graph theory to
analyze the node and connection structure of a web site. In the past few years, there was a
rapid expansion of activities in the Web content mining area. This is not surprising because of
the phenomenal growth of the Web contents and significant economic benefit of such mining.
However, due to the heterogeneity and the lack of structure of Web data, automated discovery
of targeted or unexpected knowledge information still present many challenging research
problems.
|
|