Difference between revisions of "Accessing Wikipedia data on S3"
Jump to navigation
Jump to search
(2 intermediate revisions by the same user not shown) | |||
Line 67: | Line 67: | ||
To access these files from any machine with hadoop installed, open core-site.xml and add the following: | To access these files from any machine with hadoop installed, open core-site.xml and add the following: | ||
<code> | <code> | ||
<property> | <property> | ||
<name>fs.s3n.awsAccessKeyId</name> | |||
<value>ID</value> | |||
</property> | </property> | ||
<property> | |||
<name>fs.s3n.awsSecretAccessKey</name> | |||
<value>SECRET</value> | |||
</property> | |||
</code> | |||
Note: The property name should be fs.s3n for S3 native file system. | |||
See http://wiki.apache.org/hadoop/AmazonS3 for more information | See http://wiki.apache.org/hadoop/AmazonS3 for more information |
Latest revision as of 19:19, 12 November 2012
There are 2 ways to access the Wikipedia segments:
- By HTTP. Here are the link to 27 files:
- Through Hadoop on EC2. Hadoop supports access to S3 directly, so anyone with an access key and secret key configured in core-sites.xml will be able to access it. For example
bin/hadoop fs -ls s3n://cs9223/enwiki-20121001/
(s3n is the s3 native filesystem)
To access these files from any machine with hadoop installed, open core-site.xml and add the following:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>ID</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>SECRET</value>
</property>
Note: The property name should be fs.s3n for S3 native file system.
See http://wiki.apache.org/hadoop/AmazonS3 for more information