Course Project: Wikipedia Analysis
You will analyze the Wikipedia documents and generate interesting statistics about Wikipedia content and structure. The project will be done in two phases:
Phase 1: Data pre-processing
The class will be split into different groups, each group will be assigned a task and derive output that will be shared among all students. The tasks are the following:
- 1. Identify pages that have infoboxes. You will scan the Wikipedia pages and generate a CSV file named infobox.csv with the following format for each line corresponding to a page that contains an infobox: <cade>page_id, infobox_text
If a page does not have an infobox, it will not have a line in the CSV file. The infobox_text should contain all text for the infobox, including the template name, attribute names and values.
- 2. Extract links from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named links.csv where each line corresponds to an internal Wikipedia link, i.e., a link that points to a Wikipedia page. Each line in the file should have the following format:
page_id, url
- 3. Extract text from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named content.csv where each line corresponds to a Wikipedia page and contains the page id and its content. The content should be pre-processed to remove the Wiki markup. Each line in the file should have the following format:
page_id, text
- 4. Extract metadata from Wikipedia pages. You will scan the Wikipedia pages and generate an XML file named metadata.xml with where each element corresponds to a Wikipedia page. The metadata is represented in the file using XML markup. You can use the same schema, e.g.,
<page>
<title>Albedo</title>
<ns>0</ns>
<id>39</id>
<revision>
<id>519638683</id>
<parentid>519635723</parentid>
<timestamp>2012-10-24T20:53:50Z</timestamp>
<contributor>
<username>Jenny2017</username>
<id>17023873</id>
</contributor>
</page>
In addition, you will add information about the categories assigned to the page as well as the cross-language links. Here's a sample of the categories:
And here's a sample of the cross-language links:
als:Albedo
ar:بياض
an:Albedo
ast:Albedu
bn:প্রতিফলন অনুপাত
bg:Албедо
bs:Albedo
ca:Albedo
cs:Albedo
cy:Albedo
da:Albedo
de:Albedo
et:Albeedo
Phase 2: Data Analysis
Given the data products derived in phase 1, all groups will:
- Count number of pages that have an infobox, and the number of infoboxes that do not use a template
- Group pages according to the kind of infobox they contain
- Count the number of tables in all the Wikipedia pages
- Build a histogram that shows the distribution of cross-language links across different languages, i.e., for each language show the total number of links for that language
- Build a histogram that shows the recency of the pages, i.e., the histogram should how many pages have been modified on a particular date. You can use the last date of change for the pages.
- Group articles based on their recency and display a tag cloud for each group. You can decide on the granularity of the group based on the recency histogram.
- Compute page rank for each wikipedia page and display the top 100 with the highest page rank
- Build a histogram that shows the distribution of pages across categories, i.e., the number of pages in each category
- Some pages are assigned to multiple categories. Build a histogram that shows the number of pages that have 1, 2, 3, ..., n categories.
- Compute the word co-occurrence matrix (i.e., number of times word w_i occurs with word w_j within an article)
There are different ways to present the results of some of the analysis tasks. You are free to use your creativity to come up with visualizations that make it easy to convey the results.
You can also come up with your own questions, analysis and visualization ideas. Your colleagues will vote to select the most interesting ideas and the prize will be extra credit!