How to update 10 million+ rows in MySQL single table as Fast as possible?
Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.
I have one big table which contains around 10 millions + records.
I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.
I need to take only 3 values per line out of 10-12 from the file and update it in the database.
What's the best way to achieve something like this ?
I need to do this daily.
Currently Flow is like this:
- mysqli_begin_transaction
- Read Dump file line by line
- Update each record Line by Line.
- mysqli_commit
Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me
Lock wait timeout exceeded; try restarting transaction
Update 1
data loading in new table using LOAD DATA LOCAL INFILE
. In MyISAM it took 38.93 sec
while in InnoDB it took 7 min 5.21 sec. Then I did:
UPDATE table1 t1, table2 t2
SET
t1.field1 = t2.field1,
t1.field2 = t2.field2,
t1.field3 = t2.field3
WHERE t1.field10 = t2.field10
Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)
Update 2
same update with join query
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4
(14 hours 56 min 46.85 sec)
Clarifications from questions in comments:
- About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.
- There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.
- It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.
- I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.
- Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.
- MYD of the MyISAM table is 663MB.
Update 3 :
here are more details about both table.
CREATE TABLE `content` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`files_count` smallint(5) unsigned NOT NULL DEFAULT '0',
`more_files` smallint(5) unsigned NOT NULL DEFAULT '0',
`files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',
`category` smallint(3) unsigned NOT NULL DEFAULT '600',
`size` bigint(19) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) NOT NULL DEFAULT '0',
`completed` int(11) NOT NULL DEFAULT '0',
`uploaders` int(11) NOT NULL DEFAULT '0',
`creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`vote_up` int(11) unsigned NOT NULL DEFAULT '0',
`vote_down` int(11) unsigned NOT NULL DEFAULT '0',
`comments_count` int(11) NOT NULL DEFAULT '0',
`imdb` int(8) unsigned NOT NULL DEFAULT '0',
`video_sample` tinyint(1) NOT NULL DEFAULT '0',
`video_quality` tinyint(2) NOT NULL DEFAULT '0',
`audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
`uploader` int(11) unsigned NOT NULL DEFAULT '0',
`anonymous` tinyint(1) NOT NULL DEFAULT '0',
`enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',
`tfile_size` int(11) unsigned NOT NULL DEFAULT '0',
`scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',
`record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`record_num`),
UNIQUE KEY `hash` (`hash`),
KEY `uploaders` (`uploaders`),
KEY `tfile_size` (`tfile_size`),
KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),
KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),
KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),
KEY `enabled_verified_` (`enabled`,`verified`),
KEY `enabled_uploader_` (`enabled`,`uploader`),
KEY `anonymous_uploader_` (`anonymous`,`uploader`),
KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),
KEY `enabled_verified_category` (`enabled`,`verified`,`category`),
KEY `verified_enabled_category` (`verified`,`enabled`,`category`)
) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED
CREATE TABLE `content_csv_dump_temp` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`category_id` int(11) unsigned NOT NULL DEFAULT '0',
`uploaders` int(11) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) unsigned NOT NULL DEFAULT '0',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
and here is the update query which updates content
table using data from content_csv_dump_temp
UPDATE content a JOIN content_csv_dump_temp b
ON a.hash = b.hash
SET
a.uploaders = b.uploaders,
a.downloaders = b.downloaders,
a.verified = b.verified
update 4:
all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.
mysql> UPDATE content_test a JOIN content_csv_dump_temp b
-> ON a.hash = b.hash
-> SET
-> a.uploaders = b.uploaders,
-> a.downloaders = b.downloaders,
-> a.verified = b.verified;
Query OK, 2673528 rows affected (7 min 50.42 sec)
Rows matched: 7044818 Changed: 2673528 Warnings: 0
i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.
mysql performance mysql-5.6 update myisam
add a comment |
Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.
I have one big table which contains around 10 millions + records.
I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.
I need to take only 3 values per line out of 10-12 from the file and update it in the database.
What's the best way to achieve something like this ?
I need to do this daily.
Currently Flow is like this:
- mysqli_begin_transaction
- Read Dump file line by line
- Update each record Line by Line.
- mysqli_commit
Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me
Lock wait timeout exceeded; try restarting transaction
Update 1
data loading in new table using LOAD DATA LOCAL INFILE
. In MyISAM it took 38.93 sec
while in InnoDB it took 7 min 5.21 sec. Then I did:
UPDATE table1 t1, table2 t2
SET
t1.field1 = t2.field1,
t1.field2 = t2.field2,
t1.field3 = t2.field3
WHERE t1.field10 = t2.field10
Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)
Update 2
same update with join query
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4
(14 hours 56 min 46.85 sec)
Clarifications from questions in comments:
- About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.
- There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.
- It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.
- I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.
- Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.
- MYD of the MyISAM table is 663MB.
Update 3 :
here are more details about both table.
CREATE TABLE `content` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`files_count` smallint(5) unsigned NOT NULL DEFAULT '0',
`more_files` smallint(5) unsigned NOT NULL DEFAULT '0',
`files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',
`category` smallint(3) unsigned NOT NULL DEFAULT '600',
`size` bigint(19) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) NOT NULL DEFAULT '0',
`completed` int(11) NOT NULL DEFAULT '0',
`uploaders` int(11) NOT NULL DEFAULT '0',
`creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`vote_up` int(11) unsigned NOT NULL DEFAULT '0',
`vote_down` int(11) unsigned NOT NULL DEFAULT '0',
`comments_count` int(11) NOT NULL DEFAULT '0',
`imdb` int(8) unsigned NOT NULL DEFAULT '0',
`video_sample` tinyint(1) NOT NULL DEFAULT '0',
`video_quality` tinyint(2) NOT NULL DEFAULT '0',
`audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
`uploader` int(11) unsigned NOT NULL DEFAULT '0',
`anonymous` tinyint(1) NOT NULL DEFAULT '0',
`enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',
`tfile_size` int(11) unsigned NOT NULL DEFAULT '0',
`scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',
`record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`record_num`),
UNIQUE KEY `hash` (`hash`),
KEY `uploaders` (`uploaders`),
KEY `tfile_size` (`tfile_size`),
KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),
KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),
KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),
KEY `enabled_verified_` (`enabled`,`verified`),
KEY `enabled_uploader_` (`enabled`,`uploader`),
KEY `anonymous_uploader_` (`anonymous`,`uploader`),
KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),
KEY `enabled_verified_category` (`enabled`,`verified`,`category`),
KEY `verified_enabled_category` (`verified`,`enabled`,`category`)
) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED
CREATE TABLE `content_csv_dump_temp` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`category_id` int(11) unsigned NOT NULL DEFAULT '0',
`uploaders` int(11) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) unsigned NOT NULL DEFAULT '0',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
and here is the update query which updates content
table using data from content_csv_dump_temp
UPDATE content a JOIN content_csv_dump_temp b
ON a.hash = b.hash
SET
a.uploaders = b.uploaders,
a.downloaders = b.downloaders,
a.verified = b.verified
update 4:
all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.
mysql> UPDATE content_test a JOIN content_csv_dump_temp b
-> ON a.hash = b.hash
-> SET
-> a.uploaders = b.uploaders,
-> a.downloaders = b.downloaders,
-> a.verified = b.verified;
Query OK, 2673528 rows affected (7 min 50.42 sec)
Rows matched: 7044818 Changed: 2673528 Warnings: 0
i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.
mysql performance mysql-5.6 update myisam
Do you have a compositeINDEX(field2, field3, field4)
(in any order)? Please show usSHOW CREATE TABLE
.
– Rick James
Dec 29 '15 at 6:42
1
12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.
– Rick James
Dec 29 '15 at 6:43
You have two differentUPDATEs
. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.
– Rick James
Dec 29 '15 at 15:50
@RickJames there is only oneupdate
, and please check updated question., thanks
– AMB
Dec 30 '15 at 3:08
add a comment |
Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.
I have one big table which contains around 10 millions + records.
I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.
I need to take only 3 values per line out of 10-12 from the file and update it in the database.
What's the best way to achieve something like this ?
I need to do this daily.
Currently Flow is like this:
- mysqli_begin_transaction
- Read Dump file line by line
- Update each record Line by Line.
- mysqli_commit
Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me
Lock wait timeout exceeded; try restarting transaction
Update 1
data loading in new table using LOAD DATA LOCAL INFILE
. In MyISAM it took 38.93 sec
while in InnoDB it took 7 min 5.21 sec. Then I did:
UPDATE table1 t1, table2 t2
SET
t1.field1 = t2.field1,
t1.field2 = t2.field2,
t1.field3 = t2.field3
WHERE t1.field10 = t2.field10
Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)
Update 2
same update with join query
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4
(14 hours 56 min 46.85 sec)
Clarifications from questions in comments:
- About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.
- There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.
- It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.
- I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.
- Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.
- MYD of the MyISAM table is 663MB.
Update 3 :
here are more details about both table.
CREATE TABLE `content` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`files_count` smallint(5) unsigned NOT NULL DEFAULT '0',
`more_files` smallint(5) unsigned NOT NULL DEFAULT '0',
`files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',
`category` smallint(3) unsigned NOT NULL DEFAULT '600',
`size` bigint(19) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) NOT NULL DEFAULT '0',
`completed` int(11) NOT NULL DEFAULT '0',
`uploaders` int(11) NOT NULL DEFAULT '0',
`creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`vote_up` int(11) unsigned NOT NULL DEFAULT '0',
`vote_down` int(11) unsigned NOT NULL DEFAULT '0',
`comments_count` int(11) NOT NULL DEFAULT '0',
`imdb` int(8) unsigned NOT NULL DEFAULT '0',
`video_sample` tinyint(1) NOT NULL DEFAULT '0',
`video_quality` tinyint(2) NOT NULL DEFAULT '0',
`audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
`uploader` int(11) unsigned NOT NULL DEFAULT '0',
`anonymous` tinyint(1) NOT NULL DEFAULT '0',
`enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',
`tfile_size` int(11) unsigned NOT NULL DEFAULT '0',
`scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',
`record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`record_num`),
UNIQUE KEY `hash` (`hash`),
KEY `uploaders` (`uploaders`),
KEY `tfile_size` (`tfile_size`),
KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),
KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),
KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),
KEY `enabled_verified_` (`enabled`,`verified`),
KEY `enabled_uploader_` (`enabled`,`uploader`),
KEY `anonymous_uploader_` (`anonymous`,`uploader`),
KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),
KEY `enabled_verified_category` (`enabled`,`verified`,`category`),
KEY `verified_enabled_category` (`verified`,`enabled`,`category`)
) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED
CREATE TABLE `content_csv_dump_temp` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`category_id` int(11) unsigned NOT NULL DEFAULT '0',
`uploaders` int(11) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) unsigned NOT NULL DEFAULT '0',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
and here is the update query which updates content
table using data from content_csv_dump_temp
UPDATE content a JOIN content_csv_dump_temp b
ON a.hash = b.hash
SET
a.uploaders = b.uploaders,
a.downloaders = b.downloaders,
a.verified = b.verified
update 4:
all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.
mysql> UPDATE content_test a JOIN content_csv_dump_temp b
-> ON a.hash = b.hash
-> SET
-> a.uploaders = b.uploaders,
-> a.downloaders = b.downloaders,
-> a.verified = b.verified;
Query OK, 2673528 rows affected (7 min 50.42 sec)
Rows matched: 7044818 Changed: 2673528 Warnings: 0
i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.
mysql performance mysql-5.6 update myisam
Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.
I have one big table which contains around 10 millions + records.
I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.
I need to take only 3 values per line out of 10-12 from the file and update it in the database.
What's the best way to achieve something like this ?
I need to do this daily.
Currently Flow is like this:
- mysqli_begin_transaction
- Read Dump file line by line
- Update each record Line by Line.
- mysqli_commit
Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me
Lock wait timeout exceeded; try restarting transaction
Update 1
data loading in new table using LOAD DATA LOCAL INFILE
. In MyISAM it took 38.93 sec
while in InnoDB it took 7 min 5.21 sec. Then I did:
UPDATE table1 t1, table2 t2
SET
t1.field1 = t2.field1,
t1.field2 = t2.field2,
t1.field3 = t2.field3
WHERE t1.field10 = t2.field10
Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)
Update 2
same update with join query
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4
(14 hours 56 min 46.85 sec)
Clarifications from questions in comments:
- About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.
- There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.
- It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.
- I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.
- Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.
- MYD of the MyISAM table is 663MB.
Update 3 :
here are more details about both table.
CREATE TABLE `content` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`files_count` smallint(5) unsigned NOT NULL DEFAULT '0',
`more_files` smallint(5) unsigned NOT NULL DEFAULT '0',
`files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',
`category` smallint(3) unsigned NOT NULL DEFAULT '600',
`size` bigint(19) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) NOT NULL DEFAULT '0',
`completed` int(11) NOT NULL DEFAULT '0',
`uploaders` int(11) NOT NULL DEFAULT '0',
`creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`vote_up` int(11) unsigned NOT NULL DEFAULT '0',
`vote_down` int(11) unsigned NOT NULL DEFAULT '0',
`comments_count` int(11) NOT NULL DEFAULT '0',
`imdb` int(8) unsigned NOT NULL DEFAULT '0',
`video_sample` tinyint(1) NOT NULL DEFAULT '0',
`video_quality` tinyint(2) NOT NULL DEFAULT '0',
`audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
`uploader` int(11) unsigned NOT NULL DEFAULT '0',
`anonymous` tinyint(1) NOT NULL DEFAULT '0',
`enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',
`tfile_size` int(11) unsigned NOT NULL DEFAULT '0',
`scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',
`record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`record_num`),
UNIQUE KEY `hash` (`hash`),
KEY `uploaders` (`uploaders`),
KEY `tfile_size` (`tfile_size`),
KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),
KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),
KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),
KEY `enabled_verified_` (`enabled`,`verified`),
KEY `enabled_uploader_` (`enabled`,`uploader`),
KEY `anonymous_uploader_` (`anonymous`,`uploader`),
KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),
KEY `enabled_verified_category` (`enabled`,`verified`,`category`),
KEY `verified_enabled_category` (`verified`,`enabled`,`category`)
) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED
CREATE TABLE `content_csv_dump_temp` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`category_id` int(11) unsigned NOT NULL DEFAULT '0',
`uploaders` int(11) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) unsigned NOT NULL DEFAULT '0',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
and here is the update query which updates content
table using data from content_csv_dump_temp
UPDATE content a JOIN content_csv_dump_temp b
ON a.hash = b.hash
SET
a.uploaders = b.uploaders,
a.downloaders = b.downloaders,
a.verified = b.verified
update 4:
all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.
mysql> UPDATE content_test a JOIN content_csv_dump_temp b
-> ON a.hash = b.hash
-> SET
-> a.uploaders = b.uploaders,
-> a.downloaders = b.downloaders,
-> a.verified = b.verified;
Query OK, 2673528 rows affected (7 min 50.42 sec)
Rows matched: 7044818 Changed: 2673528 Warnings: 0
i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.
mysql performance mysql-5.6 update myisam
mysql performance mysql-5.6 update myisam
edited Sep 4 '17 at 4:27
RolandoMySQLDBA
141k24219374
141k24219374
asked Oct 30 '15 at 2:32
AMBAMB
1011112
1011112
Do you have a compositeINDEX(field2, field3, field4)
(in any order)? Please show usSHOW CREATE TABLE
.
– Rick James
Dec 29 '15 at 6:42
1
12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.
– Rick James
Dec 29 '15 at 6:43
You have two differentUPDATEs
. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.
– Rick James
Dec 29 '15 at 15:50
@RickJames there is only oneupdate
, and please check updated question., thanks
– AMB
Dec 30 '15 at 3:08
add a comment |
Do you have a compositeINDEX(field2, field3, field4)
(in any order)? Please show usSHOW CREATE TABLE
.
– Rick James
Dec 29 '15 at 6:42
1
12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.
– Rick James
Dec 29 '15 at 6:43
You have two differentUPDATEs
. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.
– Rick James
Dec 29 '15 at 15:50
@RickJames there is only oneupdate
, and please check updated question., thanks
– AMB
Dec 30 '15 at 3:08
Do you have a composite
INDEX(field2, field3, field4)
(in any order)? Please show us SHOW CREATE TABLE
.– Rick James
Dec 29 '15 at 6:42
Do you have a composite
INDEX(field2, field3, field4)
(in any order)? Please show us SHOW CREATE TABLE
.– Rick James
Dec 29 '15 at 6:42
1
1
12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.
– Rick James
Dec 29 '15 at 6:43
12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.
– Rick James
Dec 29 '15 at 6:43
You have two different
UPDATEs
. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.– Rick James
Dec 29 '15 at 15:50
You have two different
UPDATEs
. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.– Rick James
Dec 29 '15 at 15:50
@RickJames there is only one
update
, and please check updated question., thanks– AMB
Dec 30 '15 at 3:08
@RickJames there is only one
update
, and please check updated question., thanks– AMB
Dec 30 '15 at 3:08
add a comment |
6 Answers
6
active
oldest
votes
Based on my experience, I would use LOAD DATA INFILE to import your CSV File.
The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.
Example I found on the internet Load Data example. I tested this example on my box and worked fine
Example Table
CREATE TABLE example (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`Column4` varchar(14) NOT NULL,
`Column5` DATE NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB
Example CSV File
# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013
Import Statement to be run from MySQL console
LOAD DATA LOCAL INFILE '/tmp/example.csv'
-> INTO TABLE example
-> FIELDS TERMINATED BY ','
-> LINES TERMINATED BY 'n'
-> IGNORE 1 LINES
-> (id, Column3,Column4, @Column5)
-> set
-> Column5 = str_to_date(@Column5, '%m/%d/%Y');
Result
MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5 |
+----+---------+---------+---------+------------+
| 1 | | Column2 | Column3 | 0000-00-00 |
| 2 | | B | Bar | 0000-00-00 |
| 3 | | C | Foo | 0000-00-00 |
| 4 | | D | Bar | 0000-00-00 |
| 5 | | E | FOObar | 0000-00-00 |
+----+---------+---------+---------+------------+
IGNORE just simply ignores the first line which are column headers.
After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.
Here is another example directly from Oracle : LOAD DATA INFILE example
This should be enough to get you started.
i could use load data for loading data in temp table and then use other queries to update it in main table., thanks
– AMB
Oct 30 '15 at 13:11
add a comment |
In light of all the things mentioned, it looks like the bottleneck is the join itself.
ASPECT #1 : Join Buffer Size
In all likelihood, your join_buffer_size is probably too low.
According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache
We only store the used columns in the join buffer, not the whole rows.
This being the case, make the keys of the join buffer stay in RAM.
You have 10 million rows times 4 bytes for each key. That's about 40M.
Try bumping it up in the session to 42M (a little bigger than 40M)
SET join_buffer_size = 1024 * 1024 * 42;
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4;
If this does the trick, proceed to add this to my.cnf
[mysqld]
join_buffer_size = 42M
Restarting mysqld not required for new connections. Just run
mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;
ASPECT #2 : Join Operation
You could manipulate the style of the join operation by tweeking the optimizer
According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins
When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.
For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.
This same page recommends doing this:
mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
ASPECT #3 : Writing Updates to Disk (OPTIONAL)
Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.
[mysqld]
innodb_write_io_threads = 16
You will have to restart MySQL for this change
GIVE IT A TRY !!!
Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!
– Peter Dixon-Moses
Dec 29 '15 at 20:26
add a comment |
CREATE TABLE
that matches the CSV
LOAD DATA
into that tableUPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;
DROP TABLE csv_table;
Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...
If nothing else is writing to the table, then...
CREATE TABLE
that matches the CSV; no indexes except what is needed in theJOIN
in theUPDATE
. If unique, make itPRIMARY KEY
.
LOAD DATA
into that table- copy the
real_table
tonew_table
(CREATE ... SELECT
) UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;
RENAME TABLE real_table TO old, new_table TO real_table;
DROP TABLE csv_table, old;
Step 3 is faster than the update, especially if unnecessary indexes are left off.
Step 5 is "instantaneous".
lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks
– AMB
Dec 25 '15 at 3:14
See whatpt-online-schema-digest
; it takes care of such issues via aTRIGGER
.
– Rick James
Dec 25 '15 at 8:39
You probably do not need any indexes on the table fromLOAD DATA
. Adding unnecessary indexes is costly (in time).
– Rick James
Dec 29 '15 at 6:40
Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just anAUTO_INCREMENT
, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.
– Rick James
Dec 29 '15 at 15:53
i have set hash asPRIMARY index
, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it asPRIMARY index
?
– AMB
Dec 30 '15 at 3:11
|
show 3 more comments
You've said:
- Updates affect 6-25% of your table
- You want to do this as fast as possible (<1hr)
- without locking
- it doesn't have to be in a single transaction
- yet (in comment on Rick James answer), you express concern about race conditions
Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.
Also, since your table is heavily indexed, both inserts and updates can be slow.
Avoiding race conditions
If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.
This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.
You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date]
)
And, importantly, this lets you run parallel updates.
Running as fast as possible —Parallelizing
With this time stamp check now in place:
- Split your batch file into some reasonable sized chunks (say 50,000 rows/file)
- In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.
- In parallel, once (2) finishes, have
mysql
run each sql file.
(e.g. In bash
look at split
and xargs -P
for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)
Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.
– Rick James
Dec 29 '15 at 6:37
You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.
– Peter Dixon-Moses
Dec 29 '15 at 11:37
I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.
– Peter Dixon-Moses
Dec 29 '15 at 11:58
add a comment |
For the UPDATE
to run fast, you need
INDEX(uploaders, downloaders, verified)
It can be on either table. The three fields can be in any order.
That will facilitate the UPDATE
being able to quickly match up the rows between the two tables.
And make the datatypes the same in the two tables (both INT SIGNED
or both INT UNSIGNED
).
this actually slowed down the update.
– AMB
Jan 1 '16 at 3:34
Hmmm... Please provideEXPLAIN UPDATE ...;
.
– Rick James
Jan 1 '16 at 4:59
add a comment |
Large updates are I/O bound. I would suggest:
- Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.
- If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).
- Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)
- When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).
New contributor
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f119621%2fhow-to-update-10-million-rows-in-mysql-single-table-as-fast-as-possible%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
6 Answers
6
active
oldest
votes
6 Answers
6
active
oldest
votes
active
oldest
votes
active
oldest
votes
Based on my experience, I would use LOAD DATA INFILE to import your CSV File.
The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.
Example I found on the internet Load Data example. I tested this example on my box and worked fine
Example Table
CREATE TABLE example (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`Column4` varchar(14) NOT NULL,
`Column5` DATE NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB
Example CSV File
# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013
Import Statement to be run from MySQL console
LOAD DATA LOCAL INFILE '/tmp/example.csv'
-> INTO TABLE example
-> FIELDS TERMINATED BY ','
-> LINES TERMINATED BY 'n'
-> IGNORE 1 LINES
-> (id, Column3,Column4, @Column5)
-> set
-> Column5 = str_to_date(@Column5, '%m/%d/%Y');
Result
MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5 |
+----+---------+---------+---------+------------+
| 1 | | Column2 | Column3 | 0000-00-00 |
| 2 | | B | Bar | 0000-00-00 |
| 3 | | C | Foo | 0000-00-00 |
| 4 | | D | Bar | 0000-00-00 |
| 5 | | E | FOObar | 0000-00-00 |
+----+---------+---------+---------+------------+
IGNORE just simply ignores the first line which are column headers.
After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.
Here is another example directly from Oracle : LOAD DATA INFILE example
This should be enough to get you started.
i could use load data for loading data in temp table and then use other queries to update it in main table., thanks
– AMB
Oct 30 '15 at 13:11
add a comment |
Based on my experience, I would use LOAD DATA INFILE to import your CSV File.
The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.
Example I found on the internet Load Data example. I tested this example on my box and worked fine
Example Table
CREATE TABLE example (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`Column4` varchar(14) NOT NULL,
`Column5` DATE NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB
Example CSV File
# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013
Import Statement to be run from MySQL console
LOAD DATA LOCAL INFILE '/tmp/example.csv'
-> INTO TABLE example
-> FIELDS TERMINATED BY ','
-> LINES TERMINATED BY 'n'
-> IGNORE 1 LINES
-> (id, Column3,Column4, @Column5)
-> set
-> Column5 = str_to_date(@Column5, '%m/%d/%Y');
Result
MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5 |
+----+---------+---------+---------+------------+
| 1 | | Column2 | Column3 | 0000-00-00 |
| 2 | | B | Bar | 0000-00-00 |
| 3 | | C | Foo | 0000-00-00 |
| 4 | | D | Bar | 0000-00-00 |
| 5 | | E | FOObar | 0000-00-00 |
+----+---------+---------+---------+------------+
IGNORE just simply ignores the first line which are column headers.
After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.
Here is another example directly from Oracle : LOAD DATA INFILE example
This should be enough to get you started.
i could use load data for loading data in temp table and then use other queries to update it in main table., thanks
– AMB
Oct 30 '15 at 13:11
add a comment |
Based on my experience, I would use LOAD DATA INFILE to import your CSV File.
The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.
Example I found on the internet Load Data example. I tested this example on my box and worked fine
Example Table
CREATE TABLE example (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`Column4` varchar(14) NOT NULL,
`Column5` DATE NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB
Example CSV File
# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013
Import Statement to be run from MySQL console
LOAD DATA LOCAL INFILE '/tmp/example.csv'
-> INTO TABLE example
-> FIELDS TERMINATED BY ','
-> LINES TERMINATED BY 'n'
-> IGNORE 1 LINES
-> (id, Column3,Column4, @Column5)
-> set
-> Column5 = str_to_date(@Column5, '%m/%d/%Y');
Result
MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5 |
+----+---------+---------+---------+------------+
| 1 | | Column2 | Column3 | 0000-00-00 |
| 2 | | B | Bar | 0000-00-00 |
| 3 | | C | Foo | 0000-00-00 |
| 4 | | D | Bar | 0000-00-00 |
| 5 | | E | FOObar | 0000-00-00 |
+----+---------+---------+---------+------------+
IGNORE just simply ignores the first line which are column headers.
After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.
Here is another example directly from Oracle : LOAD DATA INFILE example
This should be enough to get you started.
Based on my experience, I would use LOAD DATA INFILE to import your CSV File.
The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.
Example I found on the internet Load Data example. I tested this example on my box and worked fine
Example Table
CREATE TABLE example (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`Column4` varchar(14) NOT NULL,
`Column5` DATE NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB
Example CSV File
# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013
Import Statement to be run from MySQL console
LOAD DATA LOCAL INFILE '/tmp/example.csv'
-> INTO TABLE example
-> FIELDS TERMINATED BY ','
-> LINES TERMINATED BY 'n'
-> IGNORE 1 LINES
-> (id, Column3,Column4, @Column5)
-> set
-> Column5 = str_to_date(@Column5, '%m/%d/%Y');
Result
MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5 |
+----+---------+---------+---------+------------+
| 1 | | Column2 | Column3 | 0000-00-00 |
| 2 | | B | Bar | 0000-00-00 |
| 3 | | C | Foo | 0000-00-00 |
| 4 | | D | Bar | 0000-00-00 |
| 5 | | E | FOObar | 0000-00-00 |
+----+---------+---------+---------+------------+
IGNORE just simply ignores the first line which are column headers.
After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.
Here is another example directly from Oracle : LOAD DATA INFILE example
This should be enough to get you started.
edited Oct 30 '15 at 12:59
answered Oct 30 '15 at 12:45
Craig EfreinCraig Efrein
7,19784078
7,19784078
i could use load data for loading data in temp table and then use other queries to update it in main table., thanks
– AMB
Oct 30 '15 at 13:11
add a comment |
i could use load data for loading data in temp table and then use other queries to update it in main table., thanks
– AMB
Oct 30 '15 at 13:11
i could use load data for loading data in temp table and then use other queries to update it in main table., thanks
– AMB
Oct 30 '15 at 13:11
i could use load data for loading data in temp table and then use other queries to update it in main table., thanks
– AMB
Oct 30 '15 at 13:11
add a comment |
In light of all the things mentioned, it looks like the bottleneck is the join itself.
ASPECT #1 : Join Buffer Size
In all likelihood, your join_buffer_size is probably too low.
According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache
We only store the used columns in the join buffer, not the whole rows.
This being the case, make the keys of the join buffer stay in RAM.
You have 10 million rows times 4 bytes for each key. That's about 40M.
Try bumping it up in the session to 42M (a little bigger than 40M)
SET join_buffer_size = 1024 * 1024 * 42;
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4;
If this does the trick, proceed to add this to my.cnf
[mysqld]
join_buffer_size = 42M
Restarting mysqld not required for new connections. Just run
mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;
ASPECT #2 : Join Operation
You could manipulate the style of the join operation by tweeking the optimizer
According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins
When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.
For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.
This same page recommends doing this:
mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
ASPECT #3 : Writing Updates to Disk (OPTIONAL)
Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.
[mysqld]
innodb_write_io_threads = 16
You will have to restart MySQL for this change
GIVE IT A TRY !!!
Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!
– Peter Dixon-Moses
Dec 29 '15 at 20:26
add a comment |
In light of all the things mentioned, it looks like the bottleneck is the join itself.
ASPECT #1 : Join Buffer Size
In all likelihood, your join_buffer_size is probably too low.
According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache
We only store the used columns in the join buffer, not the whole rows.
This being the case, make the keys of the join buffer stay in RAM.
You have 10 million rows times 4 bytes for each key. That's about 40M.
Try bumping it up in the session to 42M (a little bigger than 40M)
SET join_buffer_size = 1024 * 1024 * 42;
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4;
If this does the trick, proceed to add this to my.cnf
[mysqld]
join_buffer_size = 42M
Restarting mysqld not required for new connections. Just run
mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;
ASPECT #2 : Join Operation
You could manipulate the style of the join operation by tweeking the optimizer
According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins
When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.
For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.
This same page recommends doing this:
mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
ASPECT #3 : Writing Updates to Disk (OPTIONAL)
Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.
[mysqld]
innodb_write_io_threads = 16
You will have to restart MySQL for this change
GIVE IT A TRY !!!
Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!
– Peter Dixon-Moses
Dec 29 '15 at 20:26
add a comment |
In light of all the things mentioned, it looks like the bottleneck is the join itself.
ASPECT #1 : Join Buffer Size
In all likelihood, your join_buffer_size is probably too low.
According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache
We only store the used columns in the join buffer, not the whole rows.
This being the case, make the keys of the join buffer stay in RAM.
You have 10 million rows times 4 bytes for each key. That's about 40M.
Try bumping it up in the session to 42M (a little bigger than 40M)
SET join_buffer_size = 1024 * 1024 * 42;
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4;
If this does the trick, proceed to add this to my.cnf
[mysqld]
join_buffer_size = 42M
Restarting mysqld not required for new connections. Just run
mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;
ASPECT #2 : Join Operation
You could manipulate the style of the join operation by tweeking the optimizer
According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins
When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.
For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.
This same page recommends doing this:
mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
ASPECT #3 : Writing Updates to Disk (OPTIONAL)
Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.
[mysqld]
innodb_write_io_threads = 16
You will have to restart MySQL for this change
GIVE IT A TRY !!!
In light of all the things mentioned, it looks like the bottleneck is the join itself.
ASPECT #1 : Join Buffer Size
In all likelihood, your join_buffer_size is probably too low.
According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache
We only store the used columns in the join buffer, not the whole rows.
This being the case, make the keys of the join buffer stay in RAM.
You have 10 million rows times 4 bytes for each key. That's about 40M.
Try bumping it up in the session to 42M (a little bigger than 40M)
SET join_buffer_size = 1024 * 1024 * 42;
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4;
If this does the trick, proceed to add this to my.cnf
[mysqld]
join_buffer_size = 42M
Restarting mysqld not required for new connections. Just run
mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;
ASPECT #2 : Join Operation
You could manipulate the style of the join operation by tweeking the optimizer
According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins
When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.
For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.
This same page recommends doing this:
mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';
ASPECT #3 : Writing Updates to Disk (OPTIONAL)
Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.
[mysqld]
innodb_write_io_threads = 16
You will have to restart MySQL for this change
GIVE IT A TRY !!!
answered Dec 28 '15 at 21:56
RolandoMySQLDBARolandoMySQLDBA
141k24219374
141k24219374
Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!
– Peter Dixon-Moses
Dec 29 '15 at 20:26
add a comment |
Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!
– Peter Dixon-Moses
Dec 29 '15 at 20:26
Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!
– Peter Dixon-Moses
Dec 29 '15 at 20:26
Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!
– Peter Dixon-Moses
Dec 29 '15 at 20:26
add a comment |
CREATE TABLE
that matches the CSV
LOAD DATA
into that tableUPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;
DROP TABLE csv_table;
Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...
If nothing else is writing to the table, then...
CREATE TABLE
that matches the CSV; no indexes except what is needed in theJOIN
in theUPDATE
. If unique, make itPRIMARY KEY
.
LOAD DATA
into that table- copy the
real_table
tonew_table
(CREATE ... SELECT
) UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;
RENAME TABLE real_table TO old, new_table TO real_table;
DROP TABLE csv_table, old;
Step 3 is faster than the update, especially if unnecessary indexes are left off.
Step 5 is "instantaneous".
lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks
– AMB
Dec 25 '15 at 3:14
See whatpt-online-schema-digest
; it takes care of such issues via aTRIGGER
.
– Rick James
Dec 25 '15 at 8:39
You probably do not need any indexes on the table fromLOAD DATA
. Adding unnecessary indexes is costly (in time).
– Rick James
Dec 29 '15 at 6:40
Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just anAUTO_INCREMENT
, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.
– Rick James
Dec 29 '15 at 15:53
i have set hash asPRIMARY index
, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it asPRIMARY index
?
– AMB
Dec 30 '15 at 3:11
|
show 3 more comments
CREATE TABLE
that matches the CSV
LOAD DATA
into that tableUPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;
DROP TABLE csv_table;
Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...
If nothing else is writing to the table, then...
CREATE TABLE
that matches the CSV; no indexes except what is needed in theJOIN
in theUPDATE
. If unique, make itPRIMARY KEY
.
LOAD DATA
into that table- copy the
real_table
tonew_table
(CREATE ... SELECT
) UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;
RENAME TABLE real_table TO old, new_table TO real_table;
DROP TABLE csv_table, old;
Step 3 is faster than the update, especially if unnecessary indexes are left off.
Step 5 is "instantaneous".
lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks
– AMB
Dec 25 '15 at 3:14
See whatpt-online-schema-digest
; it takes care of such issues via aTRIGGER
.
– Rick James
Dec 25 '15 at 8:39
You probably do not need any indexes on the table fromLOAD DATA
. Adding unnecessary indexes is costly (in time).
– Rick James
Dec 29 '15 at 6:40
Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just anAUTO_INCREMENT
, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.
– Rick James
Dec 29 '15 at 15:53
i have set hash asPRIMARY index
, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it asPRIMARY index
?
– AMB
Dec 30 '15 at 3:11
|
show 3 more comments
CREATE TABLE
that matches the CSV
LOAD DATA
into that tableUPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;
DROP TABLE csv_table;
Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...
If nothing else is writing to the table, then...
CREATE TABLE
that matches the CSV; no indexes except what is needed in theJOIN
in theUPDATE
. If unique, make itPRIMARY KEY
.
LOAD DATA
into that table- copy the
real_table
tonew_table
(CREATE ... SELECT
) UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;
RENAME TABLE real_table TO old, new_table TO real_table;
DROP TABLE csv_table, old;
Step 3 is faster than the update, especially if unnecessary indexes are left off.
Step 5 is "instantaneous".
CREATE TABLE
that matches the CSV
LOAD DATA
into that tableUPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;
DROP TABLE csv_table;
Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...
If nothing else is writing to the table, then...
CREATE TABLE
that matches the CSV; no indexes except what is needed in theJOIN
in theUPDATE
. If unique, make itPRIMARY KEY
.
LOAD DATA
into that table- copy the
real_table
tonew_table
(CREATE ... SELECT
) UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;
RENAME TABLE real_table TO old, new_table TO real_table;
DROP TABLE csv_table, old;
Step 3 is faster than the update, especially if unnecessary indexes are left off.
Step 5 is "instantaneous".
answered Dec 6 '15 at 7:49
Rick JamesRick James
41.7k22258
41.7k22258
lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks
– AMB
Dec 25 '15 at 3:14
See whatpt-online-schema-digest
; it takes care of such issues via aTRIGGER
.
– Rick James
Dec 25 '15 at 8:39
You probably do not need any indexes on the table fromLOAD DATA
. Adding unnecessary indexes is costly (in time).
– Rick James
Dec 29 '15 at 6:40
Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just anAUTO_INCREMENT
, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.
– Rick James
Dec 29 '15 at 15:53
i have set hash asPRIMARY index
, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it asPRIMARY index
?
– AMB
Dec 30 '15 at 3:11
|
show 3 more comments
lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks
– AMB
Dec 25 '15 at 3:14
See whatpt-online-schema-digest
; it takes care of such issues via aTRIGGER
.
– Rick James
Dec 25 '15 at 8:39
You probably do not need any indexes on the table fromLOAD DATA
. Adding unnecessary indexes is costly (in time).
– Rick James
Dec 29 '15 at 6:40
Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just anAUTO_INCREMENT
, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.
– Rick James
Dec 29 '15 at 15:53
i have set hash asPRIMARY index
, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it asPRIMARY index
?
– AMB
Dec 30 '15 at 3:11
lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks
– AMB
Dec 25 '15 at 3:14
lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks
– AMB
Dec 25 '15 at 3:14
See what
pt-online-schema-digest
; it takes care of such issues via a TRIGGER
.– Rick James
Dec 25 '15 at 8:39
See what
pt-online-schema-digest
; it takes care of such issues via a TRIGGER
.– Rick James
Dec 25 '15 at 8:39
You probably do not need any indexes on the table from
LOAD DATA
. Adding unnecessary indexes is costly (in time).– Rick James
Dec 29 '15 at 6:40
You probably do not need any indexes on the table from
LOAD DATA
. Adding unnecessary indexes is costly (in time).– Rick James
Dec 29 '15 at 6:40
Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an
AUTO_INCREMENT
, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.– Rick James
Dec 29 '15 at 15:53
Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an
AUTO_INCREMENT
, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.– Rick James
Dec 29 '15 at 15:53
i have set hash as
PRIMARY index
, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index
?– AMB
Dec 30 '15 at 3:11
i have set hash as
PRIMARY index
, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index
?– AMB
Dec 30 '15 at 3:11
|
show 3 more comments
You've said:
- Updates affect 6-25% of your table
- You want to do this as fast as possible (<1hr)
- without locking
- it doesn't have to be in a single transaction
- yet (in comment on Rick James answer), you express concern about race conditions
Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.
Also, since your table is heavily indexed, both inserts and updates can be slow.
Avoiding race conditions
If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.
This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.
You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date]
)
And, importantly, this lets you run parallel updates.
Running as fast as possible —Parallelizing
With this time stamp check now in place:
- Split your batch file into some reasonable sized chunks (say 50,000 rows/file)
- In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.
- In parallel, once (2) finishes, have
mysql
run each sql file.
(e.g. In bash
look at split
and xargs -P
for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)
Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.
– Rick James
Dec 29 '15 at 6:37
You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.
– Peter Dixon-Moses
Dec 29 '15 at 11:37
I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.
– Peter Dixon-Moses
Dec 29 '15 at 11:58
add a comment |
You've said:
- Updates affect 6-25% of your table
- You want to do this as fast as possible (<1hr)
- without locking
- it doesn't have to be in a single transaction
- yet (in comment on Rick James answer), you express concern about race conditions
Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.
Also, since your table is heavily indexed, both inserts and updates can be slow.
Avoiding race conditions
If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.
This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.
You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date]
)
And, importantly, this lets you run parallel updates.
Running as fast as possible —Parallelizing
With this time stamp check now in place:
- Split your batch file into some reasonable sized chunks (say 50,000 rows/file)
- In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.
- In parallel, once (2) finishes, have
mysql
run each sql file.
(e.g. In bash
look at split
and xargs -P
for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)
Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.
– Rick James
Dec 29 '15 at 6:37
You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.
– Peter Dixon-Moses
Dec 29 '15 at 11:37
I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.
– Peter Dixon-Moses
Dec 29 '15 at 11:58
add a comment |
You've said:
- Updates affect 6-25% of your table
- You want to do this as fast as possible (<1hr)
- without locking
- it doesn't have to be in a single transaction
- yet (in comment on Rick James answer), you express concern about race conditions
Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.
Also, since your table is heavily indexed, both inserts and updates can be slow.
Avoiding race conditions
If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.
This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.
You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date]
)
And, importantly, this lets you run parallel updates.
Running as fast as possible —Parallelizing
With this time stamp check now in place:
- Split your batch file into some reasonable sized chunks (say 50,000 rows/file)
- In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.
- In parallel, once (2) finishes, have
mysql
run each sql file.
(e.g. In bash
look at split
and xargs -P
for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)
You've said:
- Updates affect 6-25% of your table
- You want to do this as fast as possible (<1hr)
- without locking
- it doesn't have to be in a single transaction
- yet (in comment on Rick James answer), you express concern about race conditions
Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.
Also, since your table is heavily indexed, both inserts and updates can be slow.
Avoiding race conditions
If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.
This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.
You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date]
)
And, importantly, this lets you run parallel updates.
Running as fast as possible —Parallelizing
With this time stamp check now in place:
- Split your batch file into some reasonable sized chunks (say 50,000 rows/file)
- In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.
- In parallel, once (2) finishes, have
mysql
run each sql file.
(e.g. In bash
look at split
and xargs -P
for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)
answered Dec 28 '15 at 21:07
Peter Dixon-MosesPeter Dixon-Moses
1412
1412
Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.
– Rick James
Dec 29 '15 at 6:37
You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.
– Peter Dixon-Moses
Dec 29 '15 at 11:37
I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.
– Peter Dixon-Moses
Dec 29 '15 at 11:58
add a comment |
Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.
– Rick James
Dec 29 '15 at 6:37
You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.
– Peter Dixon-Moses
Dec 29 '15 at 11:37
I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.
– Peter Dixon-Moses
Dec 29 '15 at 11:58
Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.
– Rick James
Dec 29 '15 at 6:37
Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.
– Rick James
Dec 29 '15 at 6:37
You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.
– Peter Dixon-Moses
Dec 29 '15 at 11:37
You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.
– Peter Dixon-Moses
Dec 29 '15 at 11:37
I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.
– Peter Dixon-Moses
Dec 29 '15 at 11:58
I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.
– Peter Dixon-Moses
Dec 29 '15 at 11:58
add a comment |
For the UPDATE
to run fast, you need
INDEX(uploaders, downloaders, verified)
It can be on either table. The three fields can be in any order.
That will facilitate the UPDATE
being able to quickly match up the rows between the two tables.
And make the datatypes the same in the two tables (both INT SIGNED
or both INT UNSIGNED
).
this actually slowed down the update.
– AMB
Jan 1 '16 at 3:34
Hmmm... Please provideEXPLAIN UPDATE ...;
.
– Rick James
Jan 1 '16 at 4:59
add a comment |
For the UPDATE
to run fast, you need
INDEX(uploaders, downloaders, verified)
It can be on either table. The three fields can be in any order.
That will facilitate the UPDATE
being able to quickly match up the rows between the two tables.
And make the datatypes the same in the two tables (both INT SIGNED
or both INT UNSIGNED
).
this actually slowed down the update.
– AMB
Jan 1 '16 at 3:34
Hmmm... Please provideEXPLAIN UPDATE ...;
.
– Rick James
Jan 1 '16 at 4:59
add a comment |
For the UPDATE
to run fast, you need
INDEX(uploaders, downloaders, verified)
It can be on either table. The three fields can be in any order.
That will facilitate the UPDATE
being able to quickly match up the rows between the two tables.
And make the datatypes the same in the two tables (both INT SIGNED
or both INT UNSIGNED
).
For the UPDATE
to run fast, you need
INDEX(uploaders, downloaders, verified)
It can be on either table. The three fields can be in any order.
That will facilitate the UPDATE
being able to quickly match up the rows between the two tables.
And make the datatypes the same in the two tables (both INT SIGNED
or both INT UNSIGNED
).
answered Dec 30 '15 at 16:03
Rick JamesRick James
41.7k22258
41.7k22258
this actually slowed down the update.
– AMB
Jan 1 '16 at 3:34
Hmmm... Please provideEXPLAIN UPDATE ...;
.
– Rick James
Jan 1 '16 at 4:59
add a comment |
this actually slowed down the update.
– AMB
Jan 1 '16 at 3:34
Hmmm... Please provideEXPLAIN UPDATE ...;
.
– Rick James
Jan 1 '16 at 4:59
this actually slowed down the update.
– AMB
Jan 1 '16 at 3:34
this actually slowed down the update.
– AMB
Jan 1 '16 at 3:34
Hmmm... Please provide
EXPLAIN UPDATE ...;
.– Rick James
Jan 1 '16 at 4:59
Hmmm... Please provide
EXPLAIN UPDATE ...;
.– Rick James
Jan 1 '16 at 4:59
add a comment |
Large updates are I/O bound. I would suggest:
- Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.
- If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).
- Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)
- When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).
New contributor
add a comment |
Large updates are I/O bound. I would suggest:
- Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.
- If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).
- Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)
- When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).
New contributor
add a comment |
Large updates are I/O bound. I would suggest:
- Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.
- If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).
- Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)
- When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).
New contributor
Large updates are I/O bound. I would suggest:
- Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.
- If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).
- Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)
- When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).
New contributor
New contributor
answered 5 mins ago
user3127882user3127882
1
1
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f119621%2fhow-to-update-10-million-rows-in-mysql-single-table-as-fast-as-possible%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Do you have a composite
INDEX(field2, field3, field4)
(in any order)? Please show usSHOW CREATE TABLE
.– Rick James
Dec 29 '15 at 6:42
1
12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.
– Rick James
Dec 29 '15 at 6:43
You have two different
UPDATEs
. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.– Rick James
Dec 29 '15 at 15:50
@RickJames there is only one
update
, and please check updated question., thanks– AMB
Dec 30 '15 at 3:08