Difference between revisions of "Course Project: Wikipedia Analysis"

From VistrailsWiki
Jump to navigation Jump to search
 
(46 intermediate revisions by the same user not shown)
Line 1: Line 1:
You will analyze the Wikipedia documents and generate ''interesting'' statistics about Wikipedia content and structure. The project will be done in two phases:
You will analyze the Wikipedia documents and generate ''interesting'' statistics about Wikipedia content and structure. The project will be done in two phases:


== Phase 1: Data pre-processing ==
== Phase 1: Data pre-processing --- Due November 22nd ==
 
''Note: You should be prepared to present your phase 1 results on Monday, Nov 26th. [http://www.vistrails.org/index.php/Course_Project:_Wikipedia_Analysis#Submitting_Code_and_Results See guidelines for presentation below.]''


The class will be split into different groups, each group will be assigned a task and derive output that will be shared among all students. The tasks are the following:
The class will be split into different groups, each group will be assigned a task and derive output that will be shared among all students. The tasks are the following:
Line 9: Line 11:
If a page does not have an infobox, it will not have a line in the CSV file. The infobox_text should contain all text for the infobox, including the template name, attribute names and values.  
If a page does not have an infobox, it will not have a line in the CSV file. The infobox_text should contain all text for the infobox, including the template name, attribute names and values.  


* 2. ''Extract links from Wikipedia pages.'' You will scan the Wikipedia pages and generate a CSV file named links.csv where each line corresponds to an internal Wikipedia link, i.e., a link that points to a Wikipedia page. Each line in the file should have the following format:  <code>page_id, url</code>
''Note that the id for a Wikipedia page is its title.''
 
* 2. ''Extract links from Wikipedia pages.'' You will scan the Wikipedia pages and generate a CSV file named links.csv where each line corresponds to an internal Wikipedia link, i.e., a link that points to a Wikipedia page. Each line in the file should have the following format:  <code>page_id_source, page_id_target </code>
 


* 3. Extract text from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named '''content.csv'''  where each line corresponds to a Wikipedia page and contains the page id and its content. The content should be pre-processed to remove the Wiki markup. Each line in the file should have the following format:  <code>page_id, text</code>
* 3. Extract text from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named '''content.csv'''  where each line corresponds to a Wikipedia page and contains the page id and its content. The content should be pre-processed to remove the Wiki markup. Each line in the file should have the following format:  <code>page_id, text</code>


* 4. Extract metadata from Wikipedia pages. You will scan the Wikipedia pages and generate an XML file named '''metadata.xml''' with where each element corresponds to a Wikipedia page. The metadata is represented in the file using XML markup. You can use the same schema, e.g.,
* 4. Extract metadata from Wikipedia pages. You will scan the Wikipedia pages and generate an XML file named '''metadata.xml''' with where each element corresponds to a Wikipedia page. The metadata is represented in the file using XML markup. You can use the same schema, e.g.,
Line 31: Line 37:
</code>
</code>


In addition, you will add information about the categories assigned to the page as well as the cross-language links. Here's a sample of the categories:
In addition, you will extend the XML schema to include information about the categories assigned to the page as well as the cross-language links. Here's a sample of the categories:
<code>
<nowiki>
[[Category:Climate forcing]]
[[Category:Climate forcing]]
[[Category:Climatology]]
[[Category:Climatology]]
Line 38: Line 44:
[[Category:Radiometry]]
[[Category:Radiometry]]
[[Category:Scattering, absorption and radiative transfer (optics)]]
[[Category:Scattering, absorption and radiative transfer (optics)]]
[[Category:Radiation]]
</nowiki>
which should be represented in XML as follows:
 
<code>
<category> Climate forcing </category>
<category> Climatology </category>
<category> Electromagnetic radiation </category>
<category> Radiometry </category>
<category> Scattering, absorption and radiative transfer (optics) </category>
</code>
</code>


And here's a sample of the cross-language links:
Here's a sample of the cross-language links:
<code>
<nowiki>
[[als:Albedo]]
[[als:Albedo]]
[[ar:بياض]]
[[ar:بياض]]
Line 49: Line 63:
[[bn:প্রতিফলন অনুপাত]]
[[bn:প্রতিফলন অনুপাত]]
[[bg:Албедо]]
[[bg:Албедо]]
[[bs:Albedo]]
 
[[ca:Albedo]]
</nowiki>
[[cs:Albedo]]
which should be represented in  XML as follows:
[[cy:Albedo]]
 
[[da:Albedo]]
<code>
[[de:Albedo]]
<crosslanguage> als </crosslanguage>
[[et:Albeedo]]
<crosslanguage> ar </crosslanguage>
<crosslanguage> an </crosslanguage>
<crosslanguage> ast </crosslanguage>
<crosslanguage> bn </crosslanguage>
<crosslanguage> bg </crosslanguage>
</code>
</code>


== Phase 2: Data Analysis ==
 
For example, the XML element for the page about Albedo should have the following format:
 
<code> 
  <page>
    <title>Albedo</title>
    <ns>0</ns>
    <id>39</id>
    <revision>
      <id>519638683</id>
      <parentid>519635723</parentid>
      <timestamp>2012-10-24T20:53:50Z</timestamp>
      <contributor>
        <username>Jenny2017</username>
        <id>17023873</id>
      </contributor>
      <category> Climate forcing </category>
      <category> Climatology </category>
      <category> Electromagnetic radiation </category>
      <category> Radiometry </category>
      <category> Scattering, absorption and radiative transfer (optics) </category>
      <crosslanguage> als </crosslanguage>
      <crosslanguage> ar </crosslanguage>
      <crosslanguage> an </crosslanguage>
      <crosslanguage> ast </crosslanguage>
      <crosslanguage> bn </crosslanguage>
      <crosslanguage> bg </crosslanguage>
  </page>
</code>
 
=== Task/Group Assignment ===
==== '''Task 1:''' ====
*Diǎosī: https://docs.google.com/document/d/1U2SgYDVU4XRBmZrGijdF_AaJ7C7yUr5Gu3H8HOAq9UY/edit (Zhe Zhou conanchou1412@gmail.com)
** Project members: Zhe Zhou (conanchou1412@gmail.com), Kan Xiao (canly.xiao@gmail.com, 0464716)
** Output: https://s3.amazonaws.com/diaosi-mapreduce/hadoop_data_final/part-00000
 
==== '''Task 2: ''' ====
* Titans: https://docs.google.com/open?id=0B3e0J2n8r2MZaERFQjRtMU5FQUE
** Project members: Parth Trivedi (tprive02@students.poly.edu)  Mihir Patel (mpatel21@students.poly.edu)  Sunil Gunisetty (sgunis01@students.poly.edu)
** Output: s3n://mpatel21/wikisearch/result/task2
* Avishkar Nikum: https://docs.google.com/document/d/1h9_Giapd0gsZ2iTZ7W0ZJK1WBI9oBB6PJ2C1xHvlylQ/edit
** Project members: Avishkar Nikum  (anikum01@students.poly.edu) Akshat Amritkar  (a_akshat@yahoo.com)  Zijian Wang  (zwang17@students.poly.edu)
** Output: s3n://big.data.out/outs/
 
==== '''Task 3:''' ====
* Team Indexers: https://docs.google.com/folder/d/0B5t0jWjSi7neSjVsOENiTW5Oa3M/edit
** Project members: Juan Rodriguez (jrodri04@students.poly.edu) Qi Wang (edwinj619@gmail.com) Ching-Che Chang (cchang08@students.poly.edu)
** Output:  '''single file:'''  s3n://indexers/output/All/part-00000.bz2 '''multiple files:''' s3n://indexers/output/All/task3-*.bz2
 
* Viren Sawant (group without name): https://docs.google.com/document/d/1t3i5dCa37kuCrMXjI1-uSnqlYlKp-NmRqbLf--yHLmA/edit
** Project members: Aman Luthra  (aluthr01@students.poly.edu)  Viren Sawant (vsawan01@students.poly.edu)  Ankit Kapadia (akapad01@students.poly.edu)
** Output: S3n://task3group2output/output
 
==== '''Task 4: ''' ====
*Team-kepaassafaidladkahai: did not complete
*A.F.K: https://docs.google.com/document/d/10Sx7VJF0rNveeD10aTOvJlLoE2gjfoIobVxGnmpM_3c/edit (Kien Pham kienpt.vie@gmail.com; Tuan-Anh Hoang-Vu hvtuananh@gmail.com; Fernando Seabra Chirigati fernando.chirigati@gmail.com)
** '''Output:'''
'''(1433 output files--with xxxx from 0000 to 1432)''' s3n://cs9223-results/2012-11-21-074821
https://s3.amazonaws.com/cs9223-results/2012-11-21-074821/part-0xxxx
 
'''(296 output files---with xxx from 000 to 295)'''
s3n://wikimetadata/2012-11-25-222927
https://s3.amazonaws.com/wikimetadata/2012-11-25-222927/part-00xxx
 
=== Some notes ===
* Start by writing and testing the code on your own machine using a small data sample. Once that is working, you can move to AWS.
* We have placed the Wikipedia data set on S3. The data is split into 27 file. See instructions on how to access it:  [[Accessing Wikipedia data on S3]]
* Here are instructions on how to work with AWS (we distribute the AWS tokens next class): [[AWS Setup]]
* As you work on your task, take note of the issues you encounter so that you can discuss them during your presentation. You should also discuss the rationale for your design, in particular, how you designed your experiment and selected the number of machines and their configuration.
 
* *2012-11-19* Summary of in-class discussions
** What is the right ID for a document?
*** We will use the document title as the ID
** Which links should be extracted for task 2?
*** The output should contain *at least* the links from/to real documents. You can also include links to users, categories, etc. If you do so, then you should distinguish between the different kinds of links so that it is possible to disregard all non-content links in the page rank computation.
** How to parse the Wiki markup?
*** One of the groups had trouble with publicly available parsers. Here's the parser they are currently using and seems to work properly: Sweble -- http://sweble.org
 
== Phase 2: Data Analysis ---  Due December 17th before class!==


Given the data products derived in phase 1, all groups will:
Given the data products derived in phase 1, all groups will:
Line 64: Line 160:
* Count number of pages that have an infobox, and the number of infoboxes that do not use a template
* Count number of pages that have an infobox, and the number of infoboxes that do not use a template
* Group pages according to the kind of infobox they contain
* Group pages according to the kind of infobox they contain
* Count the number of tables in all the Wikipedia pages
* Build a histogram that shows the distribution of cross-language links across different languages, i.e., for each language show the total number of links for that language
* Build a histogram that shows the distribution of cross-language links across different languages, i.e., for each language show the total number of links for that language
* Build a histogram that shows the recency of the pages, i.e., the histogram should how many pages have been modified on a particular date.  You can use the last date of change for the pages.
* Build a histogram that shows the recency of the pages, i.e., the histogram should how many pages have been modified on a particular date.  You can use the last date of change for the pages.
Line 73: Line 168:
* Compute the word co-occurrence matrix (i.e., number of times word w_i occurs with word w_j within an article)
* Compute the word co-occurrence matrix (i.e., number of times word w_i occurs with word w_j within an article)


There are different ways to present the results of some of the analysis tasks. You are free to use your creativity to come up with visualizations that make it easy to convey the results.
You don't have to do:
* Count the number of tables in all the Wikipedia pages
 
Each group will hand in a report with the results of the analysis tasks. There are different ways to present these results. You are free to use your creativity to come up with visualizations that make it easier to convey the results.  


You can also come up with your own questions, analysis and visualization ideas. Your colleagues will vote to select the most interesting ideas and the prize will be extra credit!
You can also come up with your own questions, analysis and visualization ideas. Your colleagues will vote to select the most interesting ideas and the prize will be extra credit!
== Submitting Code and Results ==
* For phase I, you should email to me a link to a Google Doc with the following information:
** Name of your task, group name, name and email of the participants and what each member did
** Link to your code on GitHub
** Instructions on how to compile the code, and how to run it on AWS
** Description of the process you used to configure AWS, e.g., how did you select the server size? How did you select the number of nodes?
** A discussion of the performance of your code, e.g., how long did it take to run the code? Which optimizations you used, if any?
** Discuss any experiences you think will be useful to share with the class
* For Phase II, you should add the results to the Google Doc you submitted in Phase I. You should also add your code to GitHub -- you can use the same repository as in Phase I
* You will present your findings during class on Nov 26th. You will have between 10-15min for your presentation.
== Some Useful Links (suggested by your colleagues) ==
* http://wikipedia-miner.cms.waikato.ac.nz/
* https://github.com/whym/wikihadoop
* https://github.com/ogrisel/pignlproc/wiki/Splitting-a-Wikipedia-XML-dump-using-Mahout  -- WiKipediaXMLSplitter
== Acknowledgment ==
We thank Amazon for the AWS in Education Coursework grant.

Latest revision as of 18:05, 10 December 2012

You will analyze the Wikipedia documents and generate interesting statistics about Wikipedia content and structure. The project will be done in two phases:

Phase 1: Data pre-processing --- Due November 22nd

Note: You should be prepared to present your phase 1 results on Monday, Nov 26th. See guidelines for presentation below.

The class will be split into different groups, each group will be assigned a task and derive output that will be shared among all students. The tasks are the following:

  • 1. Identify pages that have infoboxes. You will scan the Wikipedia pages and generate a CSV file named infobox.csv with the following format for each line corresponding to a page that contains an infobox: <cade>page_id, infobox_text

If a page does not have an infobox, it will not have a line in the CSV file. The infobox_text should contain all text for the infobox, including the template name, attribute names and values.

Note that the id for a Wikipedia page is its title.

  • 2. Extract links from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named links.csv where each line corresponds to an internal Wikipedia link, i.e., a link that points to a Wikipedia page. Each line in the file should have the following format: page_id_source, page_id_target


  • 3. Extract text from Wikipedia pages. You will scan the Wikipedia pages and generate a CSV file named content.csv where each line corresponds to a Wikipedia page and contains the page id and its content. The content should be pre-processed to remove the Wiki markup. Each line in the file should have the following format: page_id, text


  • 4. Extract metadata from Wikipedia pages. You will scan the Wikipedia pages and generate an XML file named metadata.xml with where each element corresponds to a Wikipedia page. The metadata is represented in the file using XML markup. You can use the same schema, e.g.,

 <page>
   <title>Albedo</title>
   <ns>0</ns>
   <id>39</id>
   <revision>
     <id>519638683</id>
     <parentid>519635723</parentid>
     <timestamp>2012-10-24T20:53:50Z</timestamp>
     <contributor>
       <username>Jenny2017</username>
       <id>17023873</id>
     </contributor>
 </page>

In addition, you will extend the XML schema to include information about the categories assigned to the page as well as the cross-language links. Here's a sample of the categories: [[Category:Climate forcing]] [[Category:Climatology]] [[Category:Electromagnetic radiation]] [[Category:Radiometry]] [[Category:Scattering, absorption and radiative transfer (optics)]] which should be represented in XML as follows:

<category> Climate forcing </category>
<category> Climatology </category>
<category> Electromagnetic radiation </category>
<category> Radiometry </category>
<category> Scattering, absorption and radiative transfer (optics) </category>

Here's a sample of the cross-language links: [[als:Albedo]] [[ar:بياض]] [[an:Albedo]] [[ast:Albedu]] [[bn:প্রতিফলন অনুপাত]] [[bg:Албедо]] which should be represented in XML as follows:

<crosslanguage> als </crosslanguage>
<crosslanguage> ar </crosslanguage>
<crosslanguage> an </crosslanguage>
<crosslanguage> ast </crosslanguage>
<crosslanguage> bn </crosslanguage>
<crosslanguage> bg </crosslanguage>


For example, the XML element for the page about Albedo should have the following format:

 <page>
   <title>Albedo</title>
   <ns>0</ns>
   <id>39</id>
   <revision>
     <id>519638683</id>
     <parentid>519635723</parentid>
     <timestamp>2012-10-24T20:53:50Z</timestamp>
     <contributor>
       <username>Jenny2017</username>
       <id>17023873</id>
     </contributor>
     <category> Climate forcing </category>
     <category> Climatology </category>
     <category> Electromagnetic radiation </category>
     <category> Radiometry </category>
     <category> Scattering, absorption and radiative transfer (optics) </category>
     <crosslanguage> als </crosslanguage>
     <crosslanguage> ar </crosslanguage>
     <crosslanguage> an </crosslanguage>
     <crosslanguage> ast </crosslanguage>
     <crosslanguage> bn </crosslanguage>
     <crosslanguage> bg </crosslanguage>
 </page>

Task/Group Assignment

Task 1:

Task 2:

Task 3:

  • Team Indexers: https://docs.google.com/folder/d/0B5t0jWjSi7neSjVsOENiTW5Oa3M/edit
    • Project members: Juan Rodriguez (jrodri04@students.poly.edu) Qi Wang (edwinj619@gmail.com) Ching-Che Chang (cchang08@students.poly.edu)
    • Output: single file: s3n://indexers/output/All/part-00000.bz2 multiple files: s3n://indexers/output/All/task3-*.bz2

Task 4:

(1433 output files--with xxxx from 0000 to 1432) s3n://cs9223-results/2012-11-21-074821 https://s3.amazonaws.com/cs9223-results/2012-11-21-074821/part-0xxxx

(296 output files---with xxx from 000 to 295) s3n://wikimetadata/2012-11-25-222927 https://s3.amazonaws.com/wikimetadata/2012-11-25-222927/part-00xxx

Some notes

  • Start by writing and testing the code on your own machine using a small data sample. Once that is working, you can move to AWS.
  • We have placed the Wikipedia data set on S3. The data is split into 27 file. See instructions on how to access it: Accessing Wikipedia data on S3
  • Here are instructions on how to work with AWS (we distribute the AWS tokens next class): AWS Setup
  • As you work on your task, take note of the issues you encounter so that you can discuss them during your presentation. You should also discuss the rationale for your design, in particular, how you designed your experiment and selected the number of machines and their configuration.
  • *2012-11-19* Summary of in-class discussions
    • What is the right ID for a document?
      • We will use the document title as the ID
    • Which links should be extracted for task 2?
      • The output should contain *at least* the links from/to real documents. You can also include links to users, categories, etc. If you do so, then you should distinguish between the different kinds of links so that it is possible to disregard all non-content links in the page rank computation.
    • How to parse the Wiki markup?
      • One of the groups had trouble with publicly available parsers. Here's the parser they are currently using and seems to work properly: Sweble -- http://sweble.org

Phase 2: Data Analysis --- Due December 17th before class!

Given the data products derived in phase 1, all groups will:

  • Count number of pages that have an infobox, and the number of infoboxes that do not use a template
  • Group pages according to the kind of infobox they contain
  • Build a histogram that shows the distribution of cross-language links across different languages, i.e., for each language show the total number of links for that language
  • Build a histogram that shows the recency of the pages, i.e., the histogram should how many pages have been modified on a particular date. You can use the last date of change for the pages.
  • Group articles based on their recency and display a tag cloud for each group. You can decide on the granularity of the group based on the recency histogram.
  • Compute page rank for each wikipedia page and display the top 100 with the highest page rank
  • Build a histogram that shows the distribution of pages across categories, i.e., the number of pages in each category
  • Some pages are assigned to multiple categories. Build a histogram that shows the number of pages that have 1, 2, 3, ..., n categories.
  • Compute the word co-occurrence matrix (i.e., number of times word w_i occurs with word w_j within an article)

You don't have to do:

  • Count the number of tables in all the Wikipedia pages

Each group will hand in a report with the results of the analysis tasks. There are different ways to present these results. You are free to use your creativity to come up with visualizations that make it easier to convey the results.

You can also come up with your own questions, analysis and visualization ideas. Your colleagues will vote to select the most interesting ideas and the prize will be extra credit!

Submitting Code and Results

  • For phase I, you should email to me a link to a Google Doc with the following information:
    • Name of your task, group name, name and email of the participants and what each member did
    • Link to your code on GitHub
    • Instructions on how to compile the code, and how to run it on AWS
    • Description of the process you used to configure AWS, e.g., how did you select the server size? How did you select the number of nodes?
    • A discussion of the performance of your code, e.g., how long did it take to run the code? Which optimizations you used, if any?
    • Discuss any experiences you think will be useful to share with the class
  • For Phase II, you should add the results to the Google Doc you submitted in Phase I. You should also add your code to GitHub -- you can use the same repository as in Phase I


  • You will present your findings during class on Nov 26th. You will have between 10-15min for your presentation.

Some Useful Links (suggested by your colleagues)

Acknowledgment

We thank Amazon for the AWS in Education Coursework grant.