Difference between revisions of "Accessing Wikipedia data on S3"
Jump to navigation
Jump to search
(3 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
* By HTTP. Here are the link to 27 files: | * By HTTP. Here are the link to 27 files: | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current1.xml-p000000010p000010000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current1.xml-p000000010p000010000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current2.xml-p000010001p000025000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current2.xml-p000010001p000025000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current3.xml-p000025001p000055000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current3.xml-p000025001p000055000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current4.xml-p000055002p000104998.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current4.xml-p000055002p000104998.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current5.xml-p000105001p000184999.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current5.xml-p000105001p000184999.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current6.xml-p000185003p000305000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current6.xml-p000185003p000305000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current7.xml-p000305002p000464997.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current7.xml-p000305002p000464997.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current8.xml-p000465001p000665000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current8.xml-p000465001p000665000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current9.xml-p000665001p000925000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current9.xml-p000665001p000925000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current10.xml-p000925001p001325000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current10.xml-p000925001p001325000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current11.xml-p001325001p001825000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current11.xml-p001325001p001825000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current12.xml-p001825001p002425000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current12.xml-p001825001p002425000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current13.xml-p002425001p003124998.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current13.xml-p002425001p003124998.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current14.xml-p003125001p003924999.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current14.xml-p003125001p003924999.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current15.xml-p003925001p004825000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current15.xml-p003925001p004825000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current16.xml-p004825002p006025000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current16.xml-p004825002p006025000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current17.xml-p006025001p007524997.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current17.xml-p006025001p007524997.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current18.xml-p007525002p009225000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current18.xml-p007525002p009225000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current19.xml-p009225001p011125000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current19.xml-p009225001p011125000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current20.xml-p011125001p013324998.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current20.xml-p011125001p013324998.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current21.xml-p013325001p015725000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current21.xml-p013325001p015725000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current22.xml-p015725003p018225000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current22.xml-p015725003p018225000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current23.xml-p018225001p020925000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current23.xml-p018225001p020925000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current24.xml-p020925002p023725000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current24.xml-p020925002p023725000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current25.xml-p023725001p026625000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current25.xml-p023725001p026625000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current26.xml-p026625002p029625000.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current26.xml-p026625002p029625000.bz2 | ||
https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current27.xml-p029625001p037187679.bz2 | https://s3.amazonaws.com/cs9223/enwiki-20121001/enwiki-20121001-pages-meta-current27.xml-p029625001p037187679.bz2 | ||
Line 42: | Line 67: | ||
To access these files from any machine with hadoop installed, open core-site.xml and add the following: | To access these files from any machine with hadoop installed, open core-site.xml and add the following: | ||
<code> | <code> | ||
<property> | <property> | ||
<name>fs.s3n.awsAccessKeyId</name> | |||
<value>ID</value> | |||
</property> | </property> | ||
<property> | |||
<name>fs.s3n.awsSecretAccessKey</name> | |||
<value>SECRET</value> | |||
</property> | |||
</code> | |||
Note: The property name should be fs.s3n for S3 native file system. | |||
See http://wiki.apache.org/hadoop/AmazonS3 for more information | See http://wiki.apache.org/hadoop/AmazonS3 for more information |
Latest revision as of 19:19, 12 November 2012
There are 2 ways to access the Wikipedia segments:
- By HTTP. Here are the link to 27 files:
- Through Hadoop on EC2. Hadoop supports access to S3 directly, so anyone with an access key and secret key configured in core-sites.xml will be able to access it. For example
bin/hadoop fs -ls s3n://cs9223/enwiki-20121001/
(s3n is the s3 native filesystem)
To access these files from any machine with hadoop installed, open core-site.xml and add the following:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
Note: The property name should be fs.s3n for S3 native file system.
See http://wiki.apache.org/hadoop/AmazonS3 for more information