Hadoop 2.4 在HDFS中读取文本文件的内容

2014-11-18

Hadoop版本是2.4.1,搭建于Linux Mint 16 之上。

在我的Desktop上,有文件t1.txt,内容如下:

Sign up for GitHub. By clicking "Sign up for GitHub", you agree to our terms of service and privacy policy. We will send you account related emails occasionally

在HDFS中的/input/t1.txt内容与上面的相同。

创建项目并引入包


在eclipse中创建项目LearnHDFS,引入hadoop-2.4.1/share/hadoop/common/hadoop-common-2.4.1.jarhadoop-2.4.1/share/hadoop/hdfs/hadoop-hdfs-2.4.1.jar以及hadoop-2.4.1/share/hadoop/common/lib/目录下的所有jar包。

配置log4j


LearnHDFS项目中添加文件log4j.properties,内容如下:

log4j.rootLogger=INFO, stdout  
log4j.appender.stdout=org.apache.log4j.ConsoleAppender  
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout  
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n  
log4j.appender.logfile=org.apache.log4j.FileAppender  
log4j.appender.logfile.File=target/spring.log  
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout  
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n  

读取本地文件


LearnHDFS项目中添加文件ReadLocalFile.java,内容如下:

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import java.io.*;

public class ReadLocalFile {

    public static void main(String[] args) {
        try {
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(conf);
            Path file = new Path("file:///home/sunlt/Desktop/t1.txt");
            FSDataInputStream getIt = fs.open(file);
            BufferedReader d = new BufferedReader(new InputStreamReader(getIt));
            String s = "";
            while ((s = d.readLine()) != null) {
                System.out.println(s);
            }
            d.close();
            fs.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

}

直接用eclipse运行ReadLocalFile.java,输出结果是:

Sign up for GitHub. By clicking "Sign up for GitHub", you agree to our terms of service and privacy policy. We will send you account related emails occasionally

读取HDFS中的文件


LearnHDFS项目中添加文件ReadHDFSFile.java,内容如下:

import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;

import java.io.*;
import java.net.URI;

public class ReadHDFSFile {

    public static void main(String[] args) {
        try {
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(new URI("hdfs://localhost:9000"), conf);
            Path file = new Path("/input/t1.txt");
            FSDataInputStream getIt = fs.open(file);
            BufferedReader d = new BufferedReader(new InputStreamReader(getIt));
            String s = "";
            while ((s = d.readLine()) != null) {
                System.out.println(s);
            }
            d.close();
            fs.close();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

}

由于Hadop的core-site.xml中设置了:

<property>
  <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
</property>

所以FileSystem.get()方法的第一个参数设置为了new URI("hdfs://localhost:9000")
直接用eclipse运行ReadHDFSFile.java,输出结果是:

Sign up for GitHub. By clicking "Sign up for GitHub", you agree to our terms of service and privacy policy. We will send you account related emails occasionally

遇到的问题


1、如果只引入hadoop-2.4.1/share/hadoop/common/hadoop-common-2.4.1.jarhadoop-2.4.1/share/hadoop/hdfs/hadoop-hdfs-2.4.1.jar,在eclipse中编写代码时并不会提示错误,但是当直接在eclipse下运行程序时,会因为找不到某些类而报错,比如:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory

2、如果没有设置log4j,在运行时会有警告:

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

所以最好添加文件log4j.properties,并做一些设置。

关于HDFS的文件操作


文件操作包括读写文件、创建目录、删除文件和目录、查看文件状态等操作,这些在Tom White的《Hadoop权威指南 第2版》第3章 Hadoop分布式文件系统中有较为详细的介绍。

下面两个博客中有总结好的代码:

Hadoop HDFS文件操作 Java实现类

Querying Hadoop from Tomcat

( 完 )