load data infile 导入一个大的文件到MySQL表中,通常是一个非常耗时的过程,因为处理过程是单线程并且是单事务的,而目前的硬件条件在执行load data infile时,远远没有达到瓶颈。在MySQL 8.0.17版本,引入了一个多线程load data infile的功能,该功能实现在MySQL Shell工具中,因此需要先安装MySQL Shell。下面来对比一下传统单线程load data infile与多线程load data infile两种方式的性能差异有多大。
测试环境:
- 4 core CPU
- 8 GB RAM
- SSD Disk
- Centos 7
导入的文件load_data.csv,大约8G左右。
单线程测试:
mysql > load data infile '/usr/local/mysql/mysql-files/load_data.csv' into table single.single_load fields terminated by ',' optionally enclosed by '"' lines terminated by '\n' (user_id,visible,name,size,latitude,longitude,timestamp,public,description,inserted); Query OK, 14244516 rows affected, 0 warnings (39 min 35.5036 sec) Records: 14244516 Deleted: 0 Skipped: 0 Warnings: 0
一共耗时 39 min 35.5036 sec。
多线程测试:
mysql shell > util.importTable("/usr/local/mysql/mysql-files/load_data.csv", {schema: "parallel", table: "parallel_load", columns: ["user_id","visible","name","size","latitude","longitude","timestamp","public","description" ,"inserted"], dialect: "csv-unix", skipRows: 0, showProgress: true, fieldsOptionallyEnclosed: true, fieldsTerminatedBy: ",", linesTerminatedBy: "\n",fieldsEnclosedBy: '"',threads: 8, bytesPerChunk: "1G", maxRate: "2G"}) Importing from file '/usr/local/mysql/mysql-files/load_data.csv' to table `parallel`.`parallel_load` in MySQL Server at /tmp%2Fmysql.sock using 8 threads [Worker01] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0 [Worker02] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0 [Worker03] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0 [Worker04] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0 [Worker05] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0 [Worker06] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0 [Worker07] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0 [Worker08] parallel.parallel_load: Records: 1780564 Deleted: 0 Skipped: 0 Warnings: 0 100% (8.06 GB / 8.06 GB), 535.17 KB/s File '/usr/local/mysql/mysql-files/load_data.csv' (8.06 GB) was imported in 6 min 30.0411 sec at 18.81 MB/s Total rows affected in parallel.parallel_load: Records: 14244516 Deleted: 0 Skipped: 0 Warnings: 0
一共耗时:6 min 30.0411 sec
结果对比:
从测试结果上看,单线程耗时 39 min 35.5036 sec,多线程耗时6 min 30.0411 sec,多线程效率大约是单线程的6倍。
多线程load data infile的几个参数:
- threads,指定线程数
- bytesPerChunk,每个chunk的大小
- maxRate,线程每秒能够处理的数据最大限制
多线程load data infile内部处理过程:
- 分析要导入的数据文件
- 将大的数据文件分割成数据块(chunk)
- 多线程将数据块导入到MySQL中