How to update 10 million+ rows in MySQL single table as Fast as possible?












22















Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.



I have one big table which contains around 10 millions + records.



I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.



I need to take only 3 values per line out of 10-12 from the file and update it in the database.



What's the best way to achieve something like this ?



I need to do this daily.



Currently Flow is like this:




  1. mysqli_begin_transaction

  2. Read Dump file line by line

  3. Update each record Line by Line.

  4. mysqli_commit


Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me




Lock wait timeout exceeded; try restarting transaction




Update 1



data loading in new table using LOAD DATA LOCAL INFILE. In MyISAM it took 38.93 sec while in InnoDB it took 7 min 5.21 sec. Then I did:



UPDATE table1 t1, table2 t2
SET
t1.field1 = t2.field1,
t1.field2 = t2.field2,
t1.field3 = t2.field3
WHERE t1.field10 = t2.field10

Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)


Update 2



same update with join query



UPDATE table1 a JOIN table2 b 
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4

(14 hours 56 min 46.85 sec)




Clarifications from questions in comments:




  • About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.

  • There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.

  • It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.

  • I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.

  • Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.

  • MYD of the MyISAM table is 663MB.




Update 3 :



here are more details about both table.



CREATE TABLE `content` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`files_count` smallint(5) unsigned NOT NULL DEFAULT '0',
`more_files` smallint(5) unsigned NOT NULL DEFAULT '0',
`files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',
`category` smallint(3) unsigned NOT NULL DEFAULT '600',
`size` bigint(19) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) NOT NULL DEFAULT '0',
`completed` int(11) NOT NULL DEFAULT '0',
`uploaders` int(11) NOT NULL DEFAULT '0',
`creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`vote_up` int(11) unsigned NOT NULL DEFAULT '0',
`vote_down` int(11) unsigned NOT NULL DEFAULT '0',
`comments_count` int(11) NOT NULL DEFAULT '0',
`imdb` int(8) unsigned NOT NULL DEFAULT '0',
`video_sample` tinyint(1) NOT NULL DEFAULT '0',
`video_quality` tinyint(2) NOT NULL DEFAULT '0',
`audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
`uploader` int(11) unsigned NOT NULL DEFAULT '0',
`anonymous` tinyint(1) NOT NULL DEFAULT '0',
`enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',
`tfile_size` int(11) unsigned NOT NULL DEFAULT '0',
`scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',
`record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`record_num`),
UNIQUE KEY `hash` (`hash`),
KEY `uploaders` (`uploaders`),
KEY `tfile_size` (`tfile_size`),
KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),
KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),
KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),
KEY `enabled_verified_` (`enabled`,`verified`),
KEY `enabled_uploader_` (`enabled`,`uploader`),
KEY `anonymous_uploader_` (`anonymous`,`uploader`),
KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),
KEY `enabled_verified_category` (`enabled`,`verified`,`category`),
KEY `verified_enabled_category` (`verified`,`enabled`,`category`)
) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED


CREATE TABLE `content_csv_dump_temp` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`category_id` int(11) unsigned NOT NULL DEFAULT '0',
`uploaders` int(11) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) unsigned NOT NULL DEFAULT '0',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


and here is the update query which updates content table using data from content_csv_dump_temp



UPDATE content a JOIN content_csv_dump_temp b 
ON a.hash = b.hash
SET
a.uploaders = b.uploaders,
a.downloaders = b.downloaders,
a.verified = b.verified




update 4:



all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.



mysql> UPDATE content_test a JOIN content_csv_dump_temp b
-> ON a.hash = b.hash
-> SET
-> a.uploaders = b.uploaders,
-> a.downloaders = b.downloaders,
-> a.verified = b.verified;
Query OK, 2673528 rows affected (7 min 50.42 sec)
Rows matched: 7044818 Changed: 2673528 Warnings: 0


i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.










share|improve this question

























  • Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

    – Rick James
    Dec 29 '15 at 6:42






  • 1





    12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

    – Rick James
    Dec 29 '15 at 6:43











  • You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

    – Rick James
    Dec 29 '15 at 15:50











  • @RickJames there is only one update, and please check updated question., thanks

    – AMB
    Dec 30 '15 at 3:08
















22















Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.



I have one big table which contains around 10 millions + records.



I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.



I need to take only 3 values per line out of 10-12 from the file and update it in the database.



What's the best way to achieve something like this ?



I need to do this daily.



Currently Flow is like this:




  1. mysqli_begin_transaction

  2. Read Dump file line by line

  3. Update each record Line by Line.

  4. mysqli_commit


Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me




Lock wait timeout exceeded; try restarting transaction




Update 1



data loading in new table using LOAD DATA LOCAL INFILE. In MyISAM it took 38.93 sec while in InnoDB it took 7 min 5.21 sec. Then I did:



UPDATE table1 t1, table2 t2
SET
t1.field1 = t2.field1,
t1.field2 = t2.field2,
t1.field3 = t2.field3
WHERE t1.field10 = t2.field10

Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)


Update 2



same update with join query



UPDATE table1 a JOIN table2 b 
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4

(14 hours 56 min 46.85 sec)




Clarifications from questions in comments:




  • About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.

  • There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.

  • It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.

  • I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.

  • Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.

  • MYD of the MyISAM table is 663MB.




Update 3 :



here are more details about both table.



CREATE TABLE `content` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`files_count` smallint(5) unsigned NOT NULL DEFAULT '0',
`more_files` smallint(5) unsigned NOT NULL DEFAULT '0',
`files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',
`category` smallint(3) unsigned NOT NULL DEFAULT '600',
`size` bigint(19) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) NOT NULL DEFAULT '0',
`completed` int(11) NOT NULL DEFAULT '0',
`uploaders` int(11) NOT NULL DEFAULT '0',
`creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`vote_up` int(11) unsigned NOT NULL DEFAULT '0',
`vote_down` int(11) unsigned NOT NULL DEFAULT '0',
`comments_count` int(11) NOT NULL DEFAULT '0',
`imdb` int(8) unsigned NOT NULL DEFAULT '0',
`video_sample` tinyint(1) NOT NULL DEFAULT '0',
`video_quality` tinyint(2) NOT NULL DEFAULT '0',
`audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
`uploader` int(11) unsigned NOT NULL DEFAULT '0',
`anonymous` tinyint(1) NOT NULL DEFAULT '0',
`enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',
`tfile_size` int(11) unsigned NOT NULL DEFAULT '0',
`scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',
`record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`record_num`),
UNIQUE KEY `hash` (`hash`),
KEY `uploaders` (`uploaders`),
KEY `tfile_size` (`tfile_size`),
KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),
KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),
KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),
KEY `enabled_verified_` (`enabled`,`verified`),
KEY `enabled_uploader_` (`enabled`,`uploader`),
KEY `anonymous_uploader_` (`anonymous`,`uploader`),
KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),
KEY `enabled_verified_category` (`enabled`,`verified`,`category`),
KEY `verified_enabled_category` (`verified`,`enabled`,`category`)
) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED


CREATE TABLE `content_csv_dump_temp` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`category_id` int(11) unsigned NOT NULL DEFAULT '0',
`uploaders` int(11) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) unsigned NOT NULL DEFAULT '0',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


and here is the update query which updates content table using data from content_csv_dump_temp



UPDATE content a JOIN content_csv_dump_temp b 
ON a.hash = b.hash
SET
a.uploaders = b.uploaders,
a.downloaders = b.downloaders,
a.verified = b.verified




update 4:



all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.



mysql> UPDATE content_test a JOIN content_csv_dump_temp b
-> ON a.hash = b.hash
-> SET
-> a.uploaders = b.uploaders,
-> a.downloaders = b.downloaders,
-> a.verified = b.verified;
Query OK, 2673528 rows affected (7 min 50.42 sec)
Rows matched: 7044818 Changed: 2673528 Warnings: 0


i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.










share|improve this question

























  • Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

    – Rick James
    Dec 29 '15 at 6:42






  • 1





    12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

    – Rick James
    Dec 29 '15 at 6:43











  • You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

    – Rick James
    Dec 29 '15 at 15:50











  • @RickJames there is only one update, and please check updated question., thanks

    – AMB
    Dec 30 '15 at 3:08














22












22








22


9






Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.



I have one big table which contains around 10 millions + records.



I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.



I need to take only 3 values per line out of 10-12 from the file and update it in the database.



What's the best way to achieve something like this ?



I need to do this daily.



Currently Flow is like this:




  1. mysqli_begin_transaction

  2. Read Dump file line by line

  3. Update each record Line by Line.

  4. mysqli_commit


Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me




Lock wait timeout exceeded; try restarting transaction




Update 1



data loading in new table using LOAD DATA LOCAL INFILE. In MyISAM it took 38.93 sec while in InnoDB it took 7 min 5.21 sec. Then I did:



UPDATE table1 t1, table2 t2
SET
t1.field1 = t2.field1,
t1.field2 = t2.field2,
t1.field3 = t2.field3
WHERE t1.field10 = t2.field10

Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)


Update 2



same update with join query



UPDATE table1 a JOIN table2 b 
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4

(14 hours 56 min 46.85 sec)




Clarifications from questions in comments:




  • About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.

  • There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.

  • It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.

  • I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.

  • Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.

  • MYD of the MyISAM table is 663MB.




Update 3 :



here are more details about both table.



CREATE TABLE `content` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`files_count` smallint(5) unsigned NOT NULL DEFAULT '0',
`more_files` smallint(5) unsigned NOT NULL DEFAULT '0',
`files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',
`category` smallint(3) unsigned NOT NULL DEFAULT '600',
`size` bigint(19) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) NOT NULL DEFAULT '0',
`completed` int(11) NOT NULL DEFAULT '0',
`uploaders` int(11) NOT NULL DEFAULT '0',
`creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`vote_up` int(11) unsigned NOT NULL DEFAULT '0',
`vote_down` int(11) unsigned NOT NULL DEFAULT '0',
`comments_count` int(11) NOT NULL DEFAULT '0',
`imdb` int(8) unsigned NOT NULL DEFAULT '0',
`video_sample` tinyint(1) NOT NULL DEFAULT '0',
`video_quality` tinyint(2) NOT NULL DEFAULT '0',
`audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
`uploader` int(11) unsigned NOT NULL DEFAULT '0',
`anonymous` tinyint(1) NOT NULL DEFAULT '0',
`enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',
`tfile_size` int(11) unsigned NOT NULL DEFAULT '0',
`scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',
`record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`record_num`),
UNIQUE KEY `hash` (`hash`),
KEY `uploaders` (`uploaders`),
KEY `tfile_size` (`tfile_size`),
KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),
KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),
KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),
KEY `enabled_verified_` (`enabled`,`verified`),
KEY `enabled_uploader_` (`enabled`,`uploader`),
KEY `anonymous_uploader_` (`anonymous`,`uploader`),
KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),
KEY `enabled_verified_category` (`enabled`,`verified`,`category`),
KEY `verified_enabled_category` (`verified`,`enabled`,`category`)
) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED


CREATE TABLE `content_csv_dump_temp` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`category_id` int(11) unsigned NOT NULL DEFAULT '0',
`uploaders` int(11) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) unsigned NOT NULL DEFAULT '0',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


and here is the update query which updates content table using data from content_csv_dump_temp



UPDATE content a JOIN content_csv_dump_temp b 
ON a.hash = b.hash
SET
a.uploaders = b.uploaders,
a.downloaders = b.downloaders,
a.verified = b.verified




update 4:



all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.



mysql> UPDATE content_test a JOIN content_csv_dump_temp b
-> ON a.hash = b.hash
-> SET
-> a.uploaders = b.uploaders,
-> a.downloaders = b.downloaders,
-> a.verified = b.verified;
Query OK, 2673528 rows affected (7 min 50.42 sec)
Rows matched: 7044818 Changed: 2673528 Warnings: 0


i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.










share|improve this question
















Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.



I have one big table which contains around 10 millions + records.



I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.



I need to take only 3 values per line out of 10-12 from the file and update it in the database.



What's the best way to achieve something like this ?



I need to do this daily.



Currently Flow is like this:




  1. mysqli_begin_transaction

  2. Read Dump file line by line

  3. Update each record Line by Line.

  4. mysqli_commit


Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me




Lock wait timeout exceeded; try restarting transaction




Update 1



data loading in new table using LOAD DATA LOCAL INFILE. In MyISAM it took 38.93 sec while in InnoDB it took 7 min 5.21 sec. Then I did:



UPDATE table1 t1, table2 t2
SET
t1.field1 = t2.field1,
t1.field2 = t2.field2,
t1.field3 = t2.field3
WHERE t1.field10 = t2.field10

Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)


Update 2



same update with join query



UPDATE table1 a JOIN table2 b 
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4

(14 hours 56 min 46.85 sec)




Clarifications from questions in comments:




  • About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.

  • There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.

  • It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.

  • I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.

  • Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.

  • MYD of the MyISAM table is 663MB.




Update 3 :



here are more details about both table.



CREATE TABLE `content` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',
`files_count` smallint(5) unsigned NOT NULL DEFAULT '0',
`more_files` smallint(5) unsigned NOT NULL DEFAULT '0',
`files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',
`category` smallint(3) unsigned NOT NULL DEFAULT '600',
`size` bigint(19) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) NOT NULL DEFAULT '0',
`completed` int(11) NOT NULL DEFAULT '0',
`uploaders` int(11) NOT NULL DEFAULT '0',
`creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`vote_up` int(11) unsigned NOT NULL DEFAULT '0',
`vote_down` int(11) unsigned NOT NULL DEFAULT '0',
`comments_count` int(11) NOT NULL DEFAULT '0',
`imdb` int(8) unsigned NOT NULL DEFAULT '0',
`video_sample` tinyint(1) NOT NULL DEFAULT '0',
`video_quality` tinyint(2) NOT NULL DEFAULT '0',
`audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
`uploader` int(11) unsigned NOT NULL DEFAULT '0',
`anonymous` tinyint(1) NOT NULL DEFAULT '0',
`enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',
`tfile_size` int(11) unsigned NOT NULL DEFAULT '0',
`scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',
`record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`record_num`),
UNIQUE KEY `hash` (`hash`),
KEY `uploaders` (`uploaders`),
KEY `tfile_size` (`tfile_size`),
KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),
KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),
KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),
KEY `enabled_verified_` (`enabled`,`verified`),
KEY `enabled_uploader_` (`enabled`,`uploader`),
KEY `anonymous_uploader_` (`anonymous`,`uploader`),
KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),
KEY `enabled_verified_category` (`enabled`,`verified`,`category`),
KEY `verified_enabled_category` (`verified`,`enabled`,`category`)
) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED


CREATE TABLE `content_csv_dump_temp` (
`hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',
`title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`category_id` int(11) unsigned NOT NULL DEFAULT '0',
`uploaders` int(11) unsigned NOT NULL DEFAULT '0',
`downloaders` int(11) unsigned NOT NULL DEFAULT '0',
`verified` tinyint(1) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


and here is the update query which updates content table using data from content_csv_dump_temp



UPDATE content a JOIN content_csv_dump_temp b 
ON a.hash = b.hash
SET
a.uploaders = b.uploaders,
a.downloaders = b.downloaders,
a.verified = b.verified




update 4:



all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.



mysql> UPDATE content_test a JOIN content_csv_dump_temp b
-> ON a.hash = b.hash
-> SET
-> a.uploaders = b.uploaders,
-> a.downloaders = b.downloaders,
-> a.verified = b.verified;
Query OK, 2673528 rows affected (7 min 50.42 sec)
Rows matched: 7044818 Changed: 2673528 Warnings: 0


i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.







mysql performance mysql-5.6 update myisam






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Sep 4 '17 at 4:27









RolandoMySQLDBA

141k24219374




141k24219374










asked Oct 30 '15 at 2:32









AMBAMB

1011112




1011112













  • Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

    – Rick James
    Dec 29 '15 at 6:42






  • 1





    12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

    – Rick James
    Dec 29 '15 at 6:43











  • You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

    – Rick James
    Dec 29 '15 at 15:50











  • @RickJames there is only one update, and please check updated question., thanks

    – AMB
    Dec 30 '15 at 3:08



















  • Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

    – Rick James
    Dec 29 '15 at 6:42






  • 1





    12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

    – Rick James
    Dec 29 '15 at 6:43











  • You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

    – Rick James
    Dec 29 '15 at 15:50











  • @RickJames there is only one update, and please check updated question., thanks

    – AMB
    Dec 30 '15 at 3:08

















Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

– Rick James
Dec 29 '15 at 6:42





Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

– Rick James
Dec 29 '15 at 6:42




1




1





12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

– Rick James
Dec 29 '15 at 6:43





12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

– Rick James
Dec 29 '15 at 6:43













You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

– Rick James
Dec 29 '15 at 15:50





You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

– Rick James
Dec 29 '15 at 15:50













@RickJames there is only one update, and please check updated question., thanks

– AMB
Dec 30 '15 at 3:08





@RickJames there is only one update, and please check updated question., thanks

– AMB
Dec 30 '15 at 3:08










6 Answers
6






active

oldest

votes


















13














Based on my experience, I would use LOAD DATA INFILE to import your CSV File.




The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.




Example I found on the internet Load Data example. I tested this example on my box and worked fine



Example Table



CREATE TABLE example (
`Id` int(11) NOT NULL AUTO_INCREMENT,
`Column2` varchar(14) NOT NULL,
`Column3` varchar(14) NOT NULL,
`Column4` varchar(14) NOT NULL,
`Column5` DATE NOT NULL,
PRIMARY KEY (`Id`)
) ENGINE=InnoDB


Example CSV File



# more /tmp/example.csv
Column1,Column2,Column3,Column4,Column5
1,A,Foo,sdsdsd,4/13/2013
2,B,Bar,sdsa,4/12/2013
3,C,Foo,wewqe,3/12/2013
4,D,Bar,asdsad,2/1/2013
5,E,FOObar,wewqe,5/1/2013


Import Statement to be run from MySQL console



LOAD DATA LOCAL INFILE '/tmp/example.csv'
-> INTO TABLE example
-> FIELDS TERMINATED BY ','
-> LINES TERMINATED BY 'n'
-> IGNORE 1 LINES
-> (id, Column3,Column4, @Column5)
-> set
-> Column5 = str_to_date(@Column5, '%m/%d/%Y');


Result



MySQL [testcsv]> select * from example;
+----+---------+---------+---------+------------+
| Id | Column2 | Column3 | Column4 | Column5 |
+----+---------+---------+---------+------------+
| 1 | | Column2 | Column3 | 0000-00-00 |
| 2 | | B | Bar | 0000-00-00 |
| 3 | | C | Foo | 0000-00-00 |
| 4 | | D | Bar | 0000-00-00 |
| 5 | | E | FOObar | 0000-00-00 |
+----+---------+---------+---------+------------+


IGNORE just simply ignores the first line which are column headers.



After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.



Here is another example directly from Oracle : LOAD DATA INFILE example



This should be enough to get you started.






share|improve this answer


























  • i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

    – AMB
    Oct 30 '15 at 13:11



















12





+100









In light of all the things mentioned, it looks like the bottleneck is the join itself.



ASPECT #1 : Join Buffer Size



In all likelihood, your join_buffer_size is probably too low.



According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache




We only store the used columns in the join buffer, not the whole rows.




This being the case, make the keys of the join buffer stay in RAM.



You have 10 million rows times 4 bytes for each key. That's about 40M.



Try bumping it up in the session to 42M (a little bigger than 40M)



SET join_buffer_size = 1024 * 1024 * 42;
UPDATE table1 a JOIN table2 b
ON a.field1 = b.field1
SET
a.field2 = b.field2,
a.field3 = b.field3,
a.field4 = b.field4;


If this does the trick, proceed to add this to my.cnf



[mysqld]
join_buffer_size = 42M


Restarting mysqld not required for new connections. Just run



mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;


ASPECT #2 : Join Operation



You could manipulate the style of the join operation by tweeking the optimizer



According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins




When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.



For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.




This same page recommends doing this:



mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';


ASPECT #3 : Writing Updates to Disk (OPTIONAL)



Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.



[mysqld]
innodb_write_io_threads = 16


You will have to restart MySQL for this change



GIVE IT A TRY !!!






share|improve this answer
























  • Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

    – Peter Dixon-Moses
    Dec 29 '15 at 20:26





















3
















  1. CREATE TABLE that matches the CSV


  2. LOAD DATA into that table

  3. UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

  4. DROP TABLE csv_table;


Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...



If nothing else is writing to the table, then...





  1. CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.


  2. LOAD DATA into that table

  3. copy the real_table to new_table (CREATE ... SELECT)

  4. UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

  5. RENAME TABLE real_table TO old, new_table TO real_table;

  6. DROP TABLE csv_table, old;


Step 3 is faster than the update, especially if unnecessary indexes are left off.

Step 5 is "instantaneous".






share|improve this answer
























  • lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

    – AMB
    Dec 25 '15 at 3:14











  • See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

    – Rick James
    Dec 25 '15 at 8:39











  • You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

    – Rick James
    Dec 29 '15 at 6:40











  • Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

    – Rick James
    Dec 29 '15 at 15:53











  • i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

    – AMB
    Dec 30 '15 at 3:11





















2














You've said:




  • Updates affect 6-25% of your table

  • You want to do this as fast as possible (<1hr)

  • without locking

  • it doesn't have to be in a single transaction

  • yet (in comment on Rick James answer), you express concern about race conditions


Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.



Also, since your table is heavily indexed, both inserts and updates can be slow.





Avoiding race conditions



If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.



This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.



You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])



And, importantly, this lets you run parallel updates.





Running as fast as possible —Parallelizing



With this time stamp check now in place:




  1. Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

  2. In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

  3. In parallel, once (2) finishes, have mysql run each sql file.


(e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)






share|improve this answer
























  • Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

    – Rick James
    Dec 29 '15 at 6:37











  • You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

    – Peter Dixon-Moses
    Dec 29 '15 at 11:37











  • I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

    – Peter Dixon-Moses
    Dec 29 '15 at 11:58



















0














For the UPDATE to run fast, you need



INDEX(uploaders, downloaders, verified)


It can be on either table. The three fields can be in any order.



That will facilitate the UPDATE being able to quickly match up the rows between the two tables.



And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).






share|improve this answer
























  • this actually slowed down the update.

    – AMB
    Jan 1 '16 at 3:34











  • Hmmm... Please provide EXPLAIN UPDATE ...;.

    – Rick James
    Jan 1 '16 at 4:59



















0














Large updates are I/O bound. I would suggest:




  1. Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

  2. If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

  3. Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

  4. When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).





share








New contributor




user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "182"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f119621%2fhow-to-update-10-million-rows-in-mysql-single-table-as-fast-as-possible%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    6 Answers
    6






    active

    oldest

    votes








    6 Answers
    6






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    13














    Based on my experience, I would use LOAD DATA INFILE to import your CSV File.




    The LOAD DATA INFILE statement reads rows from a text file into a
    table at a very high speed.




    Example I found on the internet Load Data example. I tested this example on my box and worked fine



    Example Table



    CREATE TABLE example (
    `Id` int(11) NOT NULL AUTO_INCREMENT,
    `Column2` varchar(14) NOT NULL,
    `Column3` varchar(14) NOT NULL,
    `Column4` varchar(14) NOT NULL,
    `Column5` DATE NOT NULL,
    PRIMARY KEY (`Id`)
    ) ENGINE=InnoDB


    Example CSV File



    # more /tmp/example.csv
    Column1,Column2,Column3,Column4,Column5
    1,A,Foo,sdsdsd,4/13/2013
    2,B,Bar,sdsa,4/12/2013
    3,C,Foo,wewqe,3/12/2013
    4,D,Bar,asdsad,2/1/2013
    5,E,FOObar,wewqe,5/1/2013


    Import Statement to be run from MySQL console



    LOAD DATA LOCAL INFILE '/tmp/example.csv'
    -> INTO TABLE example
    -> FIELDS TERMINATED BY ','
    -> LINES TERMINATED BY 'n'
    -> IGNORE 1 LINES
    -> (id, Column3,Column4, @Column5)
    -> set
    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');


    Result



    MySQL [testcsv]> select * from example;
    +----+---------+---------+---------+------------+
    | Id | Column2 | Column3 | Column4 | Column5 |
    +----+---------+---------+---------+------------+
    | 1 | | Column2 | Column3 | 0000-00-00 |
    | 2 | | B | Bar | 0000-00-00 |
    | 3 | | C | Foo | 0000-00-00 |
    | 4 | | D | Bar | 0000-00-00 |
    | 5 | | E | FOObar | 0000-00-00 |
    +----+---------+---------+---------+------------+


    IGNORE just simply ignores the first line which are column headers.



    After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.



    Here is another example directly from Oracle : LOAD DATA INFILE example



    This should be enough to get you started.






    share|improve this answer


























    • i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

      – AMB
      Oct 30 '15 at 13:11
















    13














    Based on my experience, I would use LOAD DATA INFILE to import your CSV File.




    The LOAD DATA INFILE statement reads rows from a text file into a
    table at a very high speed.




    Example I found on the internet Load Data example. I tested this example on my box and worked fine



    Example Table



    CREATE TABLE example (
    `Id` int(11) NOT NULL AUTO_INCREMENT,
    `Column2` varchar(14) NOT NULL,
    `Column3` varchar(14) NOT NULL,
    `Column4` varchar(14) NOT NULL,
    `Column5` DATE NOT NULL,
    PRIMARY KEY (`Id`)
    ) ENGINE=InnoDB


    Example CSV File



    # more /tmp/example.csv
    Column1,Column2,Column3,Column4,Column5
    1,A,Foo,sdsdsd,4/13/2013
    2,B,Bar,sdsa,4/12/2013
    3,C,Foo,wewqe,3/12/2013
    4,D,Bar,asdsad,2/1/2013
    5,E,FOObar,wewqe,5/1/2013


    Import Statement to be run from MySQL console



    LOAD DATA LOCAL INFILE '/tmp/example.csv'
    -> INTO TABLE example
    -> FIELDS TERMINATED BY ','
    -> LINES TERMINATED BY 'n'
    -> IGNORE 1 LINES
    -> (id, Column3,Column4, @Column5)
    -> set
    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');


    Result



    MySQL [testcsv]> select * from example;
    +----+---------+---------+---------+------------+
    | Id | Column2 | Column3 | Column4 | Column5 |
    +----+---------+---------+---------+------------+
    | 1 | | Column2 | Column3 | 0000-00-00 |
    | 2 | | B | Bar | 0000-00-00 |
    | 3 | | C | Foo | 0000-00-00 |
    | 4 | | D | Bar | 0000-00-00 |
    | 5 | | E | FOObar | 0000-00-00 |
    +----+---------+---------+---------+------------+


    IGNORE just simply ignores the first line which are column headers.



    After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.



    Here is another example directly from Oracle : LOAD DATA INFILE example



    This should be enough to get you started.






    share|improve this answer


























    • i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

      – AMB
      Oct 30 '15 at 13:11














    13












    13








    13







    Based on my experience, I would use LOAD DATA INFILE to import your CSV File.




    The LOAD DATA INFILE statement reads rows from a text file into a
    table at a very high speed.




    Example I found on the internet Load Data example. I tested this example on my box and worked fine



    Example Table



    CREATE TABLE example (
    `Id` int(11) NOT NULL AUTO_INCREMENT,
    `Column2` varchar(14) NOT NULL,
    `Column3` varchar(14) NOT NULL,
    `Column4` varchar(14) NOT NULL,
    `Column5` DATE NOT NULL,
    PRIMARY KEY (`Id`)
    ) ENGINE=InnoDB


    Example CSV File



    # more /tmp/example.csv
    Column1,Column2,Column3,Column4,Column5
    1,A,Foo,sdsdsd,4/13/2013
    2,B,Bar,sdsa,4/12/2013
    3,C,Foo,wewqe,3/12/2013
    4,D,Bar,asdsad,2/1/2013
    5,E,FOObar,wewqe,5/1/2013


    Import Statement to be run from MySQL console



    LOAD DATA LOCAL INFILE '/tmp/example.csv'
    -> INTO TABLE example
    -> FIELDS TERMINATED BY ','
    -> LINES TERMINATED BY 'n'
    -> IGNORE 1 LINES
    -> (id, Column3,Column4, @Column5)
    -> set
    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');


    Result



    MySQL [testcsv]> select * from example;
    +----+---------+---------+---------+------------+
    | Id | Column2 | Column3 | Column4 | Column5 |
    +----+---------+---------+---------+------------+
    | 1 | | Column2 | Column3 | 0000-00-00 |
    | 2 | | B | Bar | 0000-00-00 |
    | 3 | | C | Foo | 0000-00-00 |
    | 4 | | D | Bar | 0000-00-00 |
    | 5 | | E | FOObar | 0000-00-00 |
    +----+---------+---------+---------+------------+


    IGNORE just simply ignores the first line which are column headers.



    After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.



    Here is another example directly from Oracle : LOAD DATA INFILE example



    This should be enough to get you started.






    share|improve this answer















    Based on my experience, I would use LOAD DATA INFILE to import your CSV File.




    The LOAD DATA INFILE statement reads rows from a text file into a
    table at a very high speed.




    Example I found on the internet Load Data example. I tested this example on my box and worked fine



    Example Table



    CREATE TABLE example (
    `Id` int(11) NOT NULL AUTO_INCREMENT,
    `Column2` varchar(14) NOT NULL,
    `Column3` varchar(14) NOT NULL,
    `Column4` varchar(14) NOT NULL,
    `Column5` DATE NOT NULL,
    PRIMARY KEY (`Id`)
    ) ENGINE=InnoDB


    Example CSV File



    # more /tmp/example.csv
    Column1,Column2,Column3,Column4,Column5
    1,A,Foo,sdsdsd,4/13/2013
    2,B,Bar,sdsa,4/12/2013
    3,C,Foo,wewqe,3/12/2013
    4,D,Bar,asdsad,2/1/2013
    5,E,FOObar,wewqe,5/1/2013


    Import Statement to be run from MySQL console



    LOAD DATA LOCAL INFILE '/tmp/example.csv'
    -> INTO TABLE example
    -> FIELDS TERMINATED BY ','
    -> LINES TERMINATED BY 'n'
    -> IGNORE 1 LINES
    -> (id, Column3,Column4, @Column5)
    -> set
    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');


    Result



    MySQL [testcsv]> select * from example;
    +----+---------+---------+---------+------------+
    | Id | Column2 | Column3 | Column4 | Column5 |
    +----+---------+---------+---------+------------+
    | 1 | | Column2 | Column3 | 0000-00-00 |
    | 2 | | B | Bar | 0000-00-00 |
    | 3 | | C | Foo | 0000-00-00 |
    | 4 | | D | Bar | 0000-00-00 |
    | 5 | | E | FOObar | 0000-00-00 |
    +----+---------+---------+---------+------------+


    IGNORE just simply ignores the first line which are column headers.



    After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.



    Here is another example directly from Oracle : LOAD DATA INFILE example



    This should be enough to get you started.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Oct 30 '15 at 12:59

























    answered Oct 30 '15 at 12:45









    Craig EfreinCraig Efrein

    7,19784078




    7,19784078













    • i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

      – AMB
      Oct 30 '15 at 13:11



















    • i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

      – AMB
      Oct 30 '15 at 13:11

















    i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

    – AMB
    Oct 30 '15 at 13:11





    i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

    – AMB
    Oct 30 '15 at 13:11













    12





    +100









    In light of all the things mentioned, it looks like the bottleneck is the join itself.



    ASPECT #1 : Join Buffer Size



    In all likelihood, your join_buffer_size is probably too low.



    According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache




    We only store the used columns in the join buffer, not the whole rows.




    This being the case, make the keys of the join buffer stay in RAM.



    You have 10 million rows times 4 bytes for each key. That's about 40M.



    Try bumping it up in the session to 42M (a little bigger than 40M)



    SET join_buffer_size = 1024 * 1024 * 42;
    UPDATE table1 a JOIN table2 b
    ON a.field1 = b.field1
    SET
    a.field2 = b.field2,
    a.field3 = b.field3,
    a.field4 = b.field4;


    If this does the trick, proceed to add this to my.cnf



    [mysqld]
    join_buffer_size = 42M


    Restarting mysqld not required for new connections. Just run



    mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;


    ASPECT #2 : Join Operation



    You could manipulate the style of the join operation by tweeking the optimizer



    According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins




    When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.



    For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.




    This same page recommends doing this:



    mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';


    ASPECT #3 : Writing Updates to Disk (OPTIONAL)



    Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.



    [mysqld]
    innodb_write_io_threads = 16


    You will have to restart MySQL for this change



    GIVE IT A TRY !!!






    share|improve this answer
























    • Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

      – Peter Dixon-Moses
      Dec 29 '15 at 20:26


















    12





    +100









    In light of all the things mentioned, it looks like the bottleneck is the join itself.



    ASPECT #1 : Join Buffer Size



    In all likelihood, your join_buffer_size is probably too low.



    According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache




    We only store the used columns in the join buffer, not the whole rows.




    This being the case, make the keys of the join buffer stay in RAM.



    You have 10 million rows times 4 bytes for each key. That's about 40M.



    Try bumping it up in the session to 42M (a little bigger than 40M)



    SET join_buffer_size = 1024 * 1024 * 42;
    UPDATE table1 a JOIN table2 b
    ON a.field1 = b.field1
    SET
    a.field2 = b.field2,
    a.field3 = b.field3,
    a.field4 = b.field4;


    If this does the trick, proceed to add this to my.cnf



    [mysqld]
    join_buffer_size = 42M


    Restarting mysqld not required for new connections. Just run



    mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;


    ASPECT #2 : Join Operation



    You could manipulate the style of the join operation by tweeking the optimizer



    According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins




    When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.



    For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.




    This same page recommends doing this:



    mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';


    ASPECT #3 : Writing Updates to Disk (OPTIONAL)



    Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.



    [mysqld]
    innodb_write_io_threads = 16


    You will have to restart MySQL for this change



    GIVE IT A TRY !!!






    share|improve this answer
























    • Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

      – Peter Dixon-Moses
      Dec 29 '15 at 20:26
















    12





    +100







    12





    +100



    12




    +100





    In light of all the things mentioned, it looks like the bottleneck is the join itself.



    ASPECT #1 : Join Buffer Size



    In all likelihood, your join_buffer_size is probably too low.



    According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache




    We only store the used columns in the join buffer, not the whole rows.




    This being the case, make the keys of the join buffer stay in RAM.



    You have 10 million rows times 4 bytes for each key. That's about 40M.



    Try bumping it up in the session to 42M (a little bigger than 40M)



    SET join_buffer_size = 1024 * 1024 * 42;
    UPDATE table1 a JOIN table2 b
    ON a.field1 = b.field1
    SET
    a.field2 = b.field2,
    a.field3 = b.field3,
    a.field4 = b.field4;


    If this does the trick, proceed to add this to my.cnf



    [mysqld]
    join_buffer_size = 42M


    Restarting mysqld not required for new connections. Just run



    mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;


    ASPECT #2 : Join Operation



    You could manipulate the style of the join operation by tweeking the optimizer



    According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins




    When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.



    For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.




    This same page recommends doing this:



    mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';


    ASPECT #3 : Writing Updates to Disk (OPTIONAL)



    Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.



    [mysqld]
    innodb_write_io_threads = 16


    You will have to restart MySQL for this change



    GIVE IT A TRY !!!






    share|improve this answer













    In light of all the things mentioned, it looks like the bottleneck is the join itself.



    ASPECT #1 : Join Buffer Size



    In all likelihood, your join_buffer_size is probably too low.



    According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache




    We only store the used columns in the join buffer, not the whole rows.




    This being the case, make the keys of the join buffer stay in RAM.



    You have 10 million rows times 4 bytes for each key. That's about 40M.



    Try bumping it up in the session to 42M (a little bigger than 40M)



    SET join_buffer_size = 1024 * 1024 * 42;
    UPDATE table1 a JOIN table2 b
    ON a.field1 = b.field1
    SET
    a.field2 = b.field2,
    a.field3 = b.field3,
    a.field4 = b.field4;


    If this does the trick, proceed to add this to my.cnf



    [mysqld]
    join_buffer_size = 42M


    Restarting mysqld not required for new connections. Just run



    mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;


    ASPECT #2 : Join Operation



    You could manipulate the style of the join operation by tweeking the optimizer



    According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins




    When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.



    For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.




    This same page recommends doing this:



    mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';


    ASPECT #3 : Writing Updates to Disk (OPTIONAL)



    Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.



    [mysqld]
    innodb_write_io_threads = 16


    You will have to restart MySQL for this change



    GIVE IT A TRY !!!







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Dec 28 '15 at 21:56









    RolandoMySQLDBARolandoMySQLDBA

    141k24219374




    141k24219374













    • Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

      – Peter Dixon-Moses
      Dec 29 '15 at 20:26





















    • Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

      – Peter Dixon-Moses
      Dec 29 '15 at 20:26



















    Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

    – Peter Dixon-Moses
    Dec 29 '15 at 20:26







    Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

    – Peter Dixon-Moses
    Dec 29 '15 at 20:26













    3
















    1. CREATE TABLE that matches the CSV


    2. LOAD DATA into that table

    3. UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

    4. DROP TABLE csv_table;


    Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...



    If nothing else is writing to the table, then...





    1. CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.


    2. LOAD DATA into that table

    3. copy the real_table to new_table (CREATE ... SELECT)

    4. UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

    5. RENAME TABLE real_table TO old, new_table TO real_table;

    6. DROP TABLE csv_table, old;


    Step 3 is faster than the update, especially if unnecessary indexes are left off.

    Step 5 is "instantaneous".






    share|improve this answer
























    • lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

      – AMB
      Dec 25 '15 at 3:14











    • See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

      – Rick James
      Dec 25 '15 at 8:39











    • You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

      – Rick James
      Dec 29 '15 at 6:40











    • Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

      – Rick James
      Dec 29 '15 at 15:53











    • i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

      – AMB
      Dec 30 '15 at 3:11


















    3
















    1. CREATE TABLE that matches the CSV


    2. LOAD DATA into that table

    3. UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

    4. DROP TABLE csv_table;


    Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...



    If nothing else is writing to the table, then...





    1. CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.


    2. LOAD DATA into that table

    3. copy the real_table to new_table (CREATE ... SELECT)

    4. UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

    5. RENAME TABLE real_table TO old, new_table TO real_table;

    6. DROP TABLE csv_table, old;


    Step 3 is faster than the update, especially if unnecessary indexes are left off.

    Step 5 is "instantaneous".






    share|improve this answer
























    • lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

      – AMB
      Dec 25 '15 at 3:14











    • See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

      – Rick James
      Dec 25 '15 at 8:39











    • You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

      – Rick James
      Dec 29 '15 at 6:40











    • Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

      – Rick James
      Dec 29 '15 at 15:53











    • i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

      – AMB
      Dec 30 '15 at 3:11
















    3












    3








    3









    1. CREATE TABLE that matches the CSV


    2. LOAD DATA into that table

    3. UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

    4. DROP TABLE csv_table;


    Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...



    If nothing else is writing to the table, then...





    1. CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.


    2. LOAD DATA into that table

    3. copy the real_table to new_table (CREATE ... SELECT)

    4. UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

    5. RENAME TABLE real_table TO old, new_table TO real_table;

    6. DROP TABLE csv_table, old;


    Step 3 is faster than the update, especially if unnecessary indexes are left off.

    Step 5 is "instantaneous".






    share|improve this answer















    1. CREATE TABLE that matches the CSV


    2. LOAD DATA into that table

    3. UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

    4. DROP TABLE csv_table;


    Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...



    If nothing else is writing to the table, then...





    1. CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.


    2. LOAD DATA into that table

    3. copy the real_table to new_table (CREATE ... SELECT)

    4. UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

    5. RENAME TABLE real_table TO old, new_table TO real_table;

    6. DROP TABLE csv_table, old;


    Step 3 is faster than the update, especially if unnecessary indexes are left off.

    Step 5 is "instantaneous".







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Dec 6 '15 at 7:49









    Rick JamesRick James

    41.7k22258




    41.7k22258













    • lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

      – AMB
      Dec 25 '15 at 3:14











    • See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

      – Rick James
      Dec 25 '15 at 8:39











    • You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

      – Rick James
      Dec 29 '15 at 6:40











    • Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

      – Rick James
      Dec 29 '15 at 15:53











    • i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

      – AMB
      Dec 30 '15 at 3:11





















    • lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

      – AMB
      Dec 25 '15 at 3:14











    • See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

      – Rick James
      Dec 25 '15 at 8:39











    • You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

      – Rick James
      Dec 29 '15 at 6:40











    • Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

      – Rick James
      Dec 29 '15 at 15:53











    • i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

      – AMB
      Dec 30 '15 at 3:11



















    lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

    – AMB
    Dec 25 '15 at 3:14





    lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

    – AMB
    Dec 25 '15 at 3:14













    See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

    – Rick James
    Dec 25 '15 at 8:39





    See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

    – Rick James
    Dec 25 '15 at 8:39













    You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

    – Rick James
    Dec 29 '15 at 6:40





    You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

    – Rick James
    Dec 29 '15 at 6:40













    Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

    – Rick James
    Dec 29 '15 at 15:53





    Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

    – Rick James
    Dec 29 '15 at 15:53













    i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

    – AMB
    Dec 30 '15 at 3:11







    i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

    – AMB
    Dec 30 '15 at 3:11













    2














    You've said:




    • Updates affect 6-25% of your table

    • You want to do this as fast as possible (<1hr)

    • without locking

    • it doesn't have to be in a single transaction

    • yet (in comment on Rick James answer), you express concern about race conditions


    Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.



    Also, since your table is heavily indexed, both inserts and updates can be slow.





    Avoiding race conditions



    If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.



    This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.



    You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])



    And, importantly, this lets you run parallel updates.





    Running as fast as possible —Parallelizing



    With this time stamp check now in place:




    1. Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

    2. In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

    3. In parallel, once (2) finishes, have mysql run each sql file.


    (e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)






    share|improve this answer
























    • Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

      – Rick James
      Dec 29 '15 at 6:37











    • You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

      – Peter Dixon-Moses
      Dec 29 '15 at 11:37











    • I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

      – Peter Dixon-Moses
      Dec 29 '15 at 11:58
















    2














    You've said:




    • Updates affect 6-25% of your table

    • You want to do this as fast as possible (<1hr)

    • without locking

    • it doesn't have to be in a single transaction

    • yet (in comment on Rick James answer), you express concern about race conditions


    Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.



    Also, since your table is heavily indexed, both inserts and updates can be slow.





    Avoiding race conditions



    If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.



    This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.



    You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])



    And, importantly, this lets you run parallel updates.





    Running as fast as possible —Parallelizing



    With this time stamp check now in place:




    1. Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

    2. In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

    3. In parallel, once (2) finishes, have mysql run each sql file.


    (e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)






    share|improve this answer
























    • Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

      – Rick James
      Dec 29 '15 at 6:37











    • You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

      – Peter Dixon-Moses
      Dec 29 '15 at 11:37











    • I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

      – Peter Dixon-Moses
      Dec 29 '15 at 11:58














    2












    2








    2







    You've said:




    • Updates affect 6-25% of your table

    • You want to do this as fast as possible (<1hr)

    • without locking

    • it doesn't have to be in a single transaction

    • yet (in comment on Rick James answer), you express concern about race conditions


    Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.



    Also, since your table is heavily indexed, both inserts and updates can be slow.





    Avoiding race conditions



    If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.



    This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.



    You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])



    And, importantly, this lets you run parallel updates.





    Running as fast as possible —Parallelizing



    With this time stamp check now in place:




    1. Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

    2. In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

    3. In parallel, once (2) finishes, have mysql run each sql file.


    (e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)






    share|improve this answer













    You've said:




    • Updates affect 6-25% of your table

    • You want to do this as fast as possible (<1hr)

    • without locking

    • it doesn't have to be in a single transaction

    • yet (in comment on Rick James answer), you express concern about race conditions


    Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.



    Also, since your table is heavily indexed, both inserts and updates can be slow.





    Avoiding race conditions



    If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.



    This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.



    You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])



    And, importantly, this lets you run parallel updates.





    Running as fast as possible —Parallelizing



    With this time stamp check now in place:




    1. Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

    2. In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

    3. In parallel, once (2) finishes, have mysql run each sql file.


    (e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Dec 28 '15 at 21:07









    Peter Dixon-MosesPeter Dixon-Moses

    1412




    1412













    • Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

      – Rick James
      Dec 29 '15 at 6:37











    • You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

      – Peter Dixon-Moses
      Dec 29 '15 at 11:37











    • I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

      – Peter Dixon-Moses
      Dec 29 '15 at 11:58



















    • Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

      – Rick James
      Dec 29 '15 at 6:37











    • You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

      – Peter Dixon-Moses
      Dec 29 '15 at 11:37











    • I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

      – Peter Dixon-Moses
      Dec 29 '15 at 11:58

















    Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

    – Rick James
    Dec 29 '15 at 6:37





    Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

    – Rick James
    Dec 29 '15 at 6:37













    You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

    – Peter Dixon-Moses
    Dec 29 '15 at 11:37





    You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

    – Peter Dixon-Moses
    Dec 29 '15 at 11:37













    I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

    – Peter Dixon-Moses
    Dec 29 '15 at 11:58





    I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

    – Peter Dixon-Moses
    Dec 29 '15 at 11:58











    0














    For the UPDATE to run fast, you need



    INDEX(uploaders, downloaders, verified)


    It can be on either table. The three fields can be in any order.



    That will facilitate the UPDATE being able to quickly match up the rows between the two tables.



    And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).






    share|improve this answer
























    • this actually slowed down the update.

      – AMB
      Jan 1 '16 at 3:34











    • Hmmm... Please provide EXPLAIN UPDATE ...;.

      – Rick James
      Jan 1 '16 at 4:59
















    0














    For the UPDATE to run fast, you need



    INDEX(uploaders, downloaders, verified)


    It can be on either table. The three fields can be in any order.



    That will facilitate the UPDATE being able to quickly match up the rows between the two tables.



    And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).






    share|improve this answer
























    • this actually slowed down the update.

      – AMB
      Jan 1 '16 at 3:34











    • Hmmm... Please provide EXPLAIN UPDATE ...;.

      – Rick James
      Jan 1 '16 at 4:59














    0












    0








    0







    For the UPDATE to run fast, you need



    INDEX(uploaders, downloaders, verified)


    It can be on either table. The three fields can be in any order.



    That will facilitate the UPDATE being able to quickly match up the rows between the two tables.



    And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).






    share|improve this answer













    For the UPDATE to run fast, you need



    INDEX(uploaders, downloaders, verified)


    It can be on either table. The three fields can be in any order.



    That will facilitate the UPDATE being able to quickly match up the rows between the two tables.



    And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Dec 30 '15 at 16:03









    Rick JamesRick James

    41.7k22258




    41.7k22258













    • this actually slowed down the update.

      – AMB
      Jan 1 '16 at 3:34











    • Hmmm... Please provide EXPLAIN UPDATE ...;.

      – Rick James
      Jan 1 '16 at 4:59



















    • this actually slowed down the update.

      – AMB
      Jan 1 '16 at 3:34











    • Hmmm... Please provide EXPLAIN UPDATE ...;.

      – Rick James
      Jan 1 '16 at 4:59

















    this actually slowed down the update.

    – AMB
    Jan 1 '16 at 3:34





    this actually slowed down the update.

    – AMB
    Jan 1 '16 at 3:34













    Hmmm... Please provide EXPLAIN UPDATE ...;.

    – Rick James
    Jan 1 '16 at 4:59





    Hmmm... Please provide EXPLAIN UPDATE ...;.

    – Rick James
    Jan 1 '16 at 4:59











    0














    Large updates are I/O bound. I would suggest:




    1. Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

    2. If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

    3. Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

    4. When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).





    share








    New contributor




    user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.

























      0














      Large updates are I/O bound. I would suggest:




      1. Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

      2. If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

      3. Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

      4. When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).





      share








      New contributor




      user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.























        0












        0








        0







        Large updates are I/O bound. I would suggest:




        1. Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

        2. If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

        3. Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

        4. When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).





        share








        New contributor




        user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.










        Large updates are I/O bound. I would suggest:




        1. Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

        2. If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

        3. Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

        4. When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).






        share








        New contributor




        user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.








        share


        share






        New contributor




        user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered 5 mins ago









        user3127882user3127882

        1




        1




        New contributor




        user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Database Administrators Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f119621%2fhow-to-update-10-million-rows-in-mysql-single-table-as-fast-as-possible%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            SQL Server 17 - Attemping to backup to remote NAS but Access is denied

            Always On Availability groups resolving state after failover - Remote harden of transaction...

            Restoring from pg_dump with foreign key constraints