当前位置：首页 > CN2资讯 > 正文内容

配置 windows 10 mapreduce 开发环境

1天前CN2资讯

MapReduce

是什么

是Hadoop中的分布式计算框架

优点：

易于编程：
MR将所有的计算抽象为Map(映射) 与Reduce(聚合) 两个阶段
只需要继承并实现Mapper和Reducer类，就可以完成高性能的分布式程序

扩展性
与HDFS类似，HDFS是通过将多台机器的存储能力整合到集群中，提供更大的存储能力，MR是通过将多台机器的计算能力(cpu、内存）综合起来，提供海量数据的计算

高容错
高并发(多线程)的分布式程序运行过程中，一些线程出现错误或者某些机器出现故障时，MR框架可以自动启动错误重试机制，或将任务转移到其他机器运行，可以保证任务最终正确执行

适合处理超大规模数据
MR不适合处理小数据量级，而随着数据量级增大，HDFS可以存储的数据量级，MR都可以使用相同的应用程序完成计算

缺点：

计算延迟较高，不适合实时计算场景

MR任务启动时，需要读取已经存储在磁盘中的文件，如果文件不断动态追加，则MR任务无法启动，所以不能处理流式计算场景

MR任务表达能力有限，一个MR只能完成一次映射和一次聚合，DAG任务如果需要多次聚合，则需要将任务拆分成多个MR，每个MR任务都需要进行大量的磁盘IO，导致性能低下

编程模型

1、用java代码统计文本中每个单词出现的次数

import org.apache.commons.io.FileUtils; import java.io.File; import java.util.*; public class JavaWordCount { public static void main(String[] args) throws Exception { // 0.创建容器存储结果 HashMap<String, Integer> map = new HashMap<String, Integer>(); // 1.读取文件 File file = new File("C:\\projects\\idea\\bigdata2107\\amos\\amos-hadoop\\src\\main\\resources\\Harry.txt"); String encoding = "utf8"; List<String> lines = FileUtils.readLines(file, encoding); // 2.遍历每一行 for (String line : lines) { // 3.切分出每个单词 String[] words = line.split("\\s+"); for (String w : words) { // 4.每出现一个单词进行数量+1 // if (map.containsKey(word)) { // map.put(word, map.get(word) + 1); // } else { // map.put(word, 0 + 1); // } // map.put(word, map.containsKey(word) ? map.get(word) + 1 : 1); String word = w.toLowerCase() .replaceAll("\\W", ""); if (!word.isEmpty()) { map.put(word, map.getOrDefault(word, 0) + 1); } } } // 5.打印结果 System.out.println(map); // 6.将处理结果进行排序 ArrayList<Map.Entry<String, Integer>> entries = new ArrayList<>(map.entrySet()); // entries.sort(new Comparator<Map.Entry<String, Integer>>() { // @Override // public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) { // return o2.getValue() - o1.getValue(); // } // }); entries.sort((o1, o2) -> o2.getValue() - o1.getValue()); for (Map.Entry<String, Integer> entry : entries) { String word = entry.getKey(); Integer count = entry.getValue(); System.out.printf("单词:%s 出现的数量%d\n", word, count); } } }

2、用MapperReduce思想

通常一个典型的MR程序需要实现三个类

Mapper

自定义一个类继承Mapper，填写输入输出kv的四个泛型

Mapper包含四个方法
setup(context) 在map任务执行前执行一次
map(KEYIN k,VALUEIN v,context) 每次获取一组输入的kv对，进行处理，并将处理完的结果交给context进行写出
cleanup(context) 在map任务执行后执行一次
run() 将上面三个方法组织起来执行Mapper的逻辑

import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class Job_WordCountMapper // Mapper有四个泛型 // 分别是 Mapper输入的k和v类型以及输出的k v // KEYIN, VALUEIN, KEYOUT, VALUEOUT // 如果读取文本文件，则默认输入的K是LongWritable // 当前行在文本中的开始位置(字节偏移量offset) // V是 Text 是当前行文件的内容 // Mapper处理完的数据 <单词,1> // 行字节偏移量行内容单词 1 extends Mapper<LongWritable, Text, Text, IntWritable> { Text k = new Text(); IntWritable v = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1. 将读取到的文本的每行数据切分成单词 String[] words = value.toString().split("\\s+"); // 2. 将单词进行处理转小写，去掉特殊符号 for (String word : words) { String w = word.toLowerCase() .replaceAll("\\W", ""); // 3. 将单词作为当前输出的k值 k.set(w); // 4. 使用上下文对象 context.write() // 将Map处理完的结果( <单词,1> ) 写出到MR框架 context.write(k, v); } } }

Reducer

自定义一个类继承Reducer，填写输入输出kv的四个泛型

与Mapper类似也有4个方法
reduce(KEYIN k, Iterable<VALUEIN> values)方法每次接收一个key和相同Key对应的所有Value
在reduce方法中对数据进行聚合
并将处理完的结果交给context进行写出

import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class Job_WordCountReducer // Reducer与Mapper类似也有4个泛型 // mapper输出的kv类型, 单词数量 extends Reducer<Text, IntWritable, Text, LongWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { // 声明变量用于存储聚合完的结果 long count = 0; // 遍历相同Key对应的所有value for (IntWritable value : values) { // 对数量进行累加 count += value.get(); } // 使用context.write()将reducer聚合完的结果输出到MR框架 context.write(key, new LongWritable(count)); } }

Driver
是一个包含main方法的MR任务的入口类
main中获取job对象实例并添加各种配置
提交job到集群运行

import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class Job_WordCountDriver { public static void main(String[] args) throws Exception { // 0. 如果执行MR任务时需要设置自定义配置，可以使用conf对象 Configuration conf = new Configuration(); // 1. 创建Job对象实例 Job job = Job.getInstance(); // 2. 给job对象添加driver类的class job.setJarByClass(Job_WordCountDriver.class); // 3. 给job对象添加mapper类的class job.setMapperClass(Job_WordCountMapper.class); // 4. 给job对象添加reducer类的class job.setReducerClass(Job_WordCountReducer.class); // 5. 设置Mapper输出数据的Key的类型 job.setMapOutputKeyClass(Text.class); // 6. 设置Mapper输出数据的Value的类型 job.setMapOutputValueClass(IntWritable.class); // 7. 设置Reducer输出数据的Key的类型 job.setOutputKeyClass(Text.class); // 8. 设置Reducer输出数据的Value的类型 job.setOutputValueClass(LongWritable.class); // 9. 设置MR任务的输入路径 FileInputFormat.setInputPaths(job, new Path(args[0])); // 10. 设置MR任务的输出路径 FileOutputFormat.setOutputPath(job, new Path(args[1])); // 11. 提交任务 boolean b = job.waitForCompletion(true); System.exit(b ? 0 : 1); } }

你可能想看：

windows10搭建java开发环境java开发环境搭建

Windows 10 搭建Python开发环境（PyCharm ）pycharm是python的集成开发环境

PHP：IIS下的PHP开发环境搭建php开发环境搭建

Java AOT: Achieve Lightning-Fast Startup and Reduced Memory for Optimized Applications

Optimize Circuit Timing with RC Tree: Essential Techniques to Reduce Delays and Boost Performance

解决 nll_loss_forward_reduce_cuda_kernel_2d_index not implemented for 'float' 错误的有效方法

PotatoFieldImageToolkit: Effortless Potato Crop Monitoring for Higher Yields and Reduced Pests

Redis Java Client Comparison: Jedis vs Lettuce vs Redisson for Optimal Performance

解决 ckeditorerror: ckeditor-duplicated-modules.js 错误的最佳实践与技巧

windows10安装vue开发基本环境安装vue开发环境