How to update 10 million+ rows in MySQL single table as Fast as possible?

Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.

I have one big table which contains around 10 millions + records.

I get an updated dump file from a remote server every 24 hours. The file is in csv format. I don't have control over that format. The file is ~750 MB. I tried inserting data to a MyISAM table row by row and it took 35 minutes.

I need to take only 3 values per line out of 10-12 from the file and update it in the database.

What's the best way to achieve something like this ?

I need to do this daily.

Currently Flow is like this:

mysqli_begin_transaction

Read Dump file line by line

Update each record Line by Line.

mysqli_commit

Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me

Lock wait timeout exceeded; try restarting transaction

Update 1

data loading in new table using LOAD DATA LOCAL INFILE. In MyISAM it took 38.93 sec while in InnoDB it took 7 min 5.21 sec. Then I did:

UPDATE table1 t1, table2 t2

SET 

t1.field1 = t2.field1,

t1.field2 = t2.field2,

t1.field3 = t2.field3

WHERE t1.field10 = t2.field10



Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)

Update 2

same update with join query

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4



(14 hours 56 min 46.85 sec)

Clarifications from questions in comments:

About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.

There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.

It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.

I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.

Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.

MYD of the MyISAM table is 663MB.

Update 3 :

here are more details about both table.

CREATE TABLE `content` (

  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',

  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `files_count` smallint(5) unsigned NOT NULL DEFAULT '0',

  `more_files` smallint(5) unsigned NOT NULL DEFAULT '0',

  `files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',

  `category` smallint(3) unsigned NOT NULL DEFAULT '600',

  `size` bigint(19) unsigned NOT NULL DEFAULT '0',

  `downloaders` int(11) NOT NULL DEFAULT '0',

  `completed` int(11) NOT NULL DEFAULT '0',

  `uploaders` int(11) NOT NULL DEFAULT '0',

  `creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `vote_up` int(11) unsigned NOT NULL DEFAULT '0',

  `vote_down` int(11) unsigned NOT NULL DEFAULT '0',

  `comments_count` int(11) NOT NULL DEFAULT '0',

  `imdb` int(8) unsigned NOT NULL DEFAULT '0',

  `video_sample` tinyint(1) NOT NULL DEFAULT '0',

  `video_quality` tinyint(2) NOT NULL DEFAULT '0',

  `audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',

  `subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',

  `verified` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `uploader` int(11) unsigned NOT NULL DEFAULT '0',

  `anonymous` tinyint(1) NOT NULL DEFAULT '0',

  `enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `tfile_size` int(11) unsigned NOT NULL DEFAULT '0',

  `scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,

  PRIMARY KEY (`record_num`),

  UNIQUE KEY `hash` (`hash`),

  KEY `uploaders` (`uploaders`),

  KEY `tfile_size` (`tfile_size`),

  KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),

  KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),

  KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),

  KEY `enabled_verified_` (`enabled`,`verified`),

  KEY `enabled_uploader_` (`enabled`,`uploader`),

  KEY `anonymous_uploader_` (`anonymous`,`uploader`),

  KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),

  KEY `enabled_verified_category` (`enabled`,`verified`,`category`),

  KEY `verified_enabled_category` (`verified`,`enabled`,`category`)

) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED





CREATE TABLE `content_csv_dump_temp` (

  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',

  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,

  `category_id` int(11) unsigned NOT NULL DEFAULT '0',

  `uploaders` int(11) unsigned NOT NULL DEFAULT '0',

  `downloaders` int(11) unsigned NOT NULL DEFAULT '0',

  `verified` tinyint(1) unsigned NOT NULL DEFAULT '0',

  PRIMARY KEY (`hash`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

and here is the update query which updates content table using data from content_csv_dump_temp

UPDATE content a JOIN content_csv_dump_temp b 

ON a.hash = b.hash 

SET 

a.uploaders = b.uploaders,

a.downloaders = b.downloaders,

a.verified = b.verified

update 4:

all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.

mysql> UPDATE content_test a JOIN content_csv_dump_temp b

    -> ON a.hash = b.hash

    -> SET

    -> a.uploaders = b.uploaders,

    -> a.downloaders = b.downloaders,

    -> a.verified = b.verified;

Query OK, 2673528 rows affected (7 min 50.42 sec)

Rows matched: 7044818  Changed: 2673528  Warnings: 0

i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.

edited Sep 4 '17 at 4:27

RolandoMySQLDBA

141k24219374

asked Oct 30 '15 at 2:32

AMB

1011112

Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

– Rick James
Dec 29 '15 at 6:42

1

12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

– Rick James
Dec 29 '15 at 6:43

You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

– Rick James
Dec 29 '15 at 15:50

@RickJames there is only one update, and please check updated question., thanks

– AMB
Dec 30 '15 at 3:08

add a comment |

Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.

I have one big table which contains around 10 millions + records.

I need to take only 3 values per line out of 10-12 from the file and update it in the database.

What's the best way to achieve something like this ?

I need to do this daily.

Currently Flow is like this:

mysqli_begin_transaction

Read Dump file line by line

Update each record Line by Line.

mysqli_commit

Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me

Lock wait timeout exceeded; try restarting transaction

Update 1

data loading in new table using LOAD DATA LOCAL INFILE. In MyISAM it took 38.93 sec while in InnoDB it took 7 min 5.21 sec. Then I did:

UPDATE table1 t1, table2 t2

SET 

t1.field1 = t2.field1,

t1.field2 = t2.field2,

t1.field3 = t2.field3

WHERE t1.field10 = t2.field10



Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)

Update 2

same update with join query

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4



(14 hours 56 min 46.85 sec)

Clarifications from questions in comments:

About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.

There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.

It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.

I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.

Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.

MYD of the MyISAM table is 663MB.

Update 3 :

here are more details about both table.

CREATE TABLE `content` (

  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',

  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `files_count` smallint(5) unsigned NOT NULL DEFAULT '0',

  `more_files` smallint(5) unsigned NOT NULL DEFAULT '0',

  `files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',

  `category` smallint(3) unsigned NOT NULL DEFAULT '600',

  `size` bigint(19) unsigned NOT NULL DEFAULT '0',

  `downloaders` int(11) NOT NULL DEFAULT '0',

  `completed` int(11) NOT NULL DEFAULT '0',

  `uploaders` int(11) NOT NULL DEFAULT '0',

  `creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `vote_up` int(11) unsigned NOT NULL DEFAULT '0',

  `vote_down` int(11) unsigned NOT NULL DEFAULT '0',

  `comments_count` int(11) NOT NULL DEFAULT '0',

  `imdb` int(8) unsigned NOT NULL DEFAULT '0',

  `video_sample` tinyint(1) NOT NULL DEFAULT '0',

  `video_quality` tinyint(2) NOT NULL DEFAULT '0',

  `audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',

  `subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',

  `verified` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `uploader` int(11) unsigned NOT NULL DEFAULT '0',

  `anonymous` tinyint(1) NOT NULL DEFAULT '0',

  `enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `tfile_size` int(11) unsigned NOT NULL DEFAULT '0',

  `scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,

  PRIMARY KEY (`record_num`),

  UNIQUE KEY `hash` (`hash`),

  KEY `uploaders` (`uploaders`),

  KEY `tfile_size` (`tfile_size`),

  KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),

  KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),

  KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),

  KEY `enabled_verified_` (`enabled`,`verified`),

  KEY `enabled_uploader_` (`enabled`,`uploader`),

  KEY `anonymous_uploader_` (`anonymous`,`uploader`),

  KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),

  KEY `enabled_verified_category` (`enabled`,`verified`,`category`),

  KEY `verified_enabled_category` (`verified`,`enabled`,`category`)

) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED





CREATE TABLE `content_csv_dump_temp` (

  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',

  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,

  `category_id` int(11) unsigned NOT NULL DEFAULT '0',

  `uploaders` int(11) unsigned NOT NULL DEFAULT '0',

  `downloaders` int(11) unsigned NOT NULL DEFAULT '0',

  `verified` tinyint(1) unsigned NOT NULL DEFAULT '0',

  PRIMARY KEY (`hash`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

and here is the update query which updates content table using data from content_csv_dump_temp

UPDATE content a JOIN content_csv_dump_temp b 

ON a.hash = b.hash 

SET 

a.uploaders = b.uploaders,

a.downloaders = b.downloaders,

a.verified = b.verified

update 4:

all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.

mysql> UPDATE content_test a JOIN content_csv_dump_temp b

    -> ON a.hash = b.hash

    -> SET

    -> a.uploaders = b.uploaders,

    -> a.downloaders = b.downloaders,

    -> a.verified = b.verified;

Query OK, 2673528 rows affected (7 min 50.42 sec)

Rows matched: 7044818  Changed: 2673528  Warnings: 0

i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.

edited Sep 4 '17 at 4:27

RolandoMySQLDBA

141k24219374

asked Oct 30 '15 at 2:32

AMB

1011112

Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

– Rick James
Dec 29 '15 at 6:42

1

12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

– Rick James
Dec 29 '15 at 6:43

You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

– Rick James
Dec 29 '15 at 15:50

@RickJames there is only one update, and please check updated question., thanks

– AMB
Dec 30 '15 at 3:08

add a comment |

Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.

I have one big table which contains around 10 millions + records.

I need to take only 3 values per line out of 10-12 from the file and update it in the database.

What's the best way to achieve something like this ?

I need to do this daily.

Currently Flow is like this:

mysqli_begin_transaction

Read Dump file line by line

Update each record Line by Line.

mysqli_commit

Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me

Lock wait timeout exceeded; try restarting transaction

Update 1

data loading in new table using LOAD DATA LOCAL INFILE. In MyISAM it took 38.93 sec while in InnoDB it took 7 min 5.21 sec. Then I did:

UPDATE table1 t1, table2 t2

SET 

t1.field1 = t2.field1,

t1.field2 = t2.field2,

t1.field3 = t2.field3

WHERE t1.field10 = t2.field10



Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)

Update 2

same update with join query

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4



(14 hours 56 min 46.85 sec)

Clarifications from questions in comments:

About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.

There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.

It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.

I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.

Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.

MYD of the MyISAM table is 663MB.

Update 3 :

here are more details about both table.

CREATE TABLE `content` (

  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',

  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `files_count` smallint(5) unsigned NOT NULL DEFAULT '0',

  `more_files` smallint(5) unsigned NOT NULL DEFAULT '0',

  `files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',

  `category` smallint(3) unsigned NOT NULL DEFAULT '600',

  `size` bigint(19) unsigned NOT NULL DEFAULT '0',

  `downloaders` int(11) NOT NULL DEFAULT '0',

  `completed` int(11) NOT NULL DEFAULT '0',

  `uploaders` int(11) NOT NULL DEFAULT '0',

  `creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `vote_up` int(11) unsigned NOT NULL DEFAULT '0',

  `vote_down` int(11) unsigned NOT NULL DEFAULT '0',

  `comments_count` int(11) NOT NULL DEFAULT '0',

  `imdb` int(8) unsigned NOT NULL DEFAULT '0',

  `video_sample` tinyint(1) NOT NULL DEFAULT '0',

  `video_quality` tinyint(2) NOT NULL DEFAULT '0',

  `audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',

  `subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',

  `verified` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `uploader` int(11) unsigned NOT NULL DEFAULT '0',

  `anonymous` tinyint(1) NOT NULL DEFAULT '0',

  `enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `tfile_size` int(11) unsigned NOT NULL DEFAULT '0',

  `scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,

  PRIMARY KEY (`record_num`),

  UNIQUE KEY `hash` (`hash`),

  KEY `uploaders` (`uploaders`),

  KEY `tfile_size` (`tfile_size`),

  KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),

  KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),

  KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),

  KEY `enabled_verified_` (`enabled`,`verified`),

  KEY `enabled_uploader_` (`enabled`,`uploader`),

  KEY `anonymous_uploader_` (`anonymous`,`uploader`),

  KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),

  KEY `enabled_verified_category` (`enabled`,`verified`,`category`),

  KEY `verified_enabled_category` (`verified`,`enabled`,`category`)

) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED





CREATE TABLE `content_csv_dump_temp` (

  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',

  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,

  `category_id` int(11) unsigned NOT NULL DEFAULT '0',

  `uploaders` int(11) unsigned NOT NULL DEFAULT '0',

  `downloaders` int(11) unsigned NOT NULL DEFAULT '0',

  `verified` tinyint(1) unsigned NOT NULL DEFAULT '0',

  PRIMARY KEY (`hash`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

and here is the update query which updates content table using data from content_csv_dump_temp

UPDATE content a JOIN content_csv_dump_temp b 

ON a.hash = b.hash 

SET 

a.uploaders = b.uploaders,

a.downloaders = b.downloaders,

a.verified = b.verified

update 4:

all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.

mysql> UPDATE content_test a JOIN content_csv_dump_temp b

    -> ON a.hash = b.hash

    -> SET

    -> a.uploaders = b.uploaders,

    -> a.downloaders = b.downloaders,

    -> a.verified = b.verified;

Query OK, 2673528 rows affected (7 min 50.42 sec)

Rows matched: 7044818  Changed: 2673528  Warnings: 0

i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.

edited Sep 4 '17 at 4:27

RolandoMySQLDBA

141k24219374

asked Oct 30 '15 at 2:32

AMB

1011112

Using MySQL 5.6 with InnoDB storage engine for most of the tables. InnoDB buffer pool size is 15 GB and Innodb DB + indexes are around 10 GB. Server has 32GB RAM and is running Cent OS 7 x64.

I have one big table which contains around 10 millions + records.

I need to take only 3 values per line out of 10-12 from the file and update it in the database.

What's the best way to achieve something like this ?

I need to do this daily.

Currently Flow is like this:

mysqli_begin_transaction

Read Dump file line by line

Update each record Line by Line.

mysqli_commit

Above operations takes around 30-40 minutes to complete and while doing this, there are other updates going on which gives me

Lock wait timeout exceeded; try restarting transaction

Update 1

data loading in new table using LOAD DATA LOCAL INFILE. In MyISAM it took 38.93 sec while in InnoDB it took 7 min 5.21 sec. Then I did:

UPDATE table1 t1, table2 t2

SET 

t1.field1 = t2.field1,

t1.field2 = t2.field2,

t1.field3 = t2.field3

WHERE t1.field10 = t2.field10



Query OK, 434914 rows affected (22 hours 14 min 47.55 sec)

Update 2

same update with join query

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4



(14 hours 56 min 46.85 sec)

Clarifications from questions in comments:

About 6% of the rows in the table will be updated by the file, but sometimes it can be as much as 25%.

There are indexes on the fields being updated. There are 12 indexes on the table, and 8 indexes include the update fields.

It is not necessary to do the update in one transaction. It can take time but not more than 24 hours. I am looking to get it done in 1 hour without locking the whole table, as later I have to update the sphinx index which is dependent on this table. It does not matter if the steps take a longer duration as long as the database is available for other tasks.

I could modify the csv format in a preprocess step. The only thing that matters is quick update and without locking.

Table 2 is MyISAM. It is the newly created table from csv file using load data infile. MYI file size is 452 MB. Table 2 is indexed on the field1 column.

MYD of the MyISAM table is 663MB.

Update 3 :

here are more details about both table.

CREATE TABLE `content` (

  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',

  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `og_name` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `keywords` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '',

  `files_count` smallint(5) unsigned NOT NULL DEFAULT '0',

  `more_files` smallint(5) unsigned NOT NULL DEFAULT '0',

  `files` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT '0',

  `category` smallint(3) unsigned NOT NULL DEFAULT '600',

  `size` bigint(19) unsigned NOT NULL DEFAULT '0',

  `downloaders` int(11) NOT NULL DEFAULT '0',

  `completed` int(11) NOT NULL DEFAULT '0',

  `uploaders` int(11) NOT NULL DEFAULT '0',

  `creation_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `upload_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `last_updated` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',

  `vote_up` int(11) unsigned NOT NULL DEFAULT '0',

  `vote_down` int(11) unsigned NOT NULL DEFAULT '0',

  `comments_count` int(11) NOT NULL DEFAULT '0',

  `imdb` int(8) unsigned NOT NULL DEFAULT '0',

  `video_sample` tinyint(1) NOT NULL DEFAULT '0',

  `video_quality` tinyint(2) NOT NULL DEFAULT '0',

  `audio_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',

  `subtitle_lang` varchar(127) CHARACTER SET ascii NOT NULL DEFAULT '',

  `verified` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `uploader` int(11) unsigned NOT NULL DEFAULT '0',

  `anonymous` tinyint(1) NOT NULL DEFAULT '0',

  `enabled` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `tfile_size` int(11) unsigned NOT NULL DEFAULT '0',

  `scrape_source` tinyint(1) unsigned NOT NULL DEFAULT '0',

  `record_num` int(11) unsigned NOT NULL AUTO_INCREMENT,

  PRIMARY KEY (`record_num`),

  UNIQUE KEY `hash` (`hash`),

  KEY `uploaders` (`uploaders`),

  KEY `tfile_size` (`tfile_size`),

  KEY `enabled_category_upload_date_verified_` (`enabled`,`category`,`upload_date`,`verified`),

  KEY `enabled_upload_date_verified_` (`enabled`,`upload_date`,`verified`),

  KEY `enabled_category_verified_` (`enabled`,`category`,`verified`),

  KEY `enabled_verified_` (`enabled`,`verified`),

  KEY `enabled_uploader_` (`enabled`,`uploader`),

  KEY `anonymous_uploader_` (`anonymous`,`uploader`),

  KEY `enabled_uploaders_upload_date_` (`enabled`,`uploaders`,`upload_date`),

  KEY `enabled_verified_category` (`enabled`,`verified`,`category`),

  KEY `verified_enabled_category` (`verified`,`enabled`,`category`)

) ENGINE=InnoDB AUTO_INCREMENT=7551163 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=FIXED





CREATE TABLE `content_csv_dump_temp` (

  `hash` char(40) CHARACTER SET ascii NOT NULL DEFAULT '',

  `title` varchar(255) COLLATE utf8_unicode_ci NOT NULL,

  `category_id` int(11) unsigned NOT NULL DEFAULT '0',

  `uploaders` int(11) unsigned NOT NULL DEFAULT '0',

  `downloaders` int(11) unsigned NOT NULL DEFAULT '0',

  `verified` tinyint(1) unsigned NOT NULL DEFAULT '0',

  PRIMARY KEY (`hash`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

and here is the update query which updates content table using data from content_csv_dump_temp

UPDATE content a JOIN content_csv_dump_temp b 

ON a.hash = b.hash 

SET 

a.uploaders = b.uploaders,

a.downloaders = b.downloaders,

a.verified = b.verified

update 4:

all above testing was done on test machine., but now i did same tests on the production machine, and queries are very fast.

mysql> UPDATE content_test a JOIN content_csv_dump_temp b

    -> ON a.hash = b.hash

    -> SET

    -> a.uploaders = b.uploaders,

    -> a.downloaders = b.downloaders,

    -> a.verified = b.verified;

Query OK, 2673528 rows affected (7 min 50.42 sec)

Rows matched: 7044818  Changed: 2673528  Warnings: 0

i apologize for my mistake. Its better to use join instead of each record update. now i am trying to improve mpre using the index suggested by rick_james , will update once bench-marking is done.

mysql performance mysql-5.6 update myisam

edited Sep 4 '17 at 4:27

RolandoMySQLDBA

141k24219374

asked Oct 30 '15 at 2:32

AMB

1011112

edited Sep 4 '17 at 4:27

RolandoMySQLDBA

141k24219374

asked Oct 30 '15 at 2:32

AMB

1011112

edited Sep 4 '17 at 4:27

RolandoMySQLDBA

141k24219374

edited Sep 4 '17 at 4:27

RolandoMySQLDBA

141k24219374

edited Sep 4 '17 at 4:27

RolandoMySQLDBA

141k24219374

asked Oct 30 '15 at 2:32

AMB

1011112

asked Oct 30 '15 at 2:32

AMB

1011112

asked Oct 30 '15 at 2:32

AMB

1011112

Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

– Rick James
Dec 29 '15 at 6:42

1

12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

– Rick James
Dec 29 '15 at 6:43

You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

– Rick James
Dec 29 '15 at 15:50

@RickJames there is only one update, and please check updated question., thanks

– AMB
Dec 30 '15 at 3:08

add a comment |

Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

– Rick James
Dec 29 '15 at 6:42

1

12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

– Rick James
Dec 29 '15 at 6:43

You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

– Rick James
Dec 29 '15 at 15:50

@RickJames there is only one update, and please check updated question., thanks

– AMB
Dec 30 '15 at 3:08

Do you have a composite INDEX(field2, field3, field4) (in any order)? Please show us SHOW CREATE TABLE.

– Rick James
Dec 29 '15 at 6:42

12 and 8 indexes is a serious part of your problem. MyISAM is another serious part. InnoDB or TokuDB perform much better with multiple indexes.

– Rick James
Dec 29 '15 at 6:43

You have two different UPDATEs. Please tell us exactly what the straightforward statement looks like for updating the table from the csv data. Then we may be able to help you devise a technique that meets your requirements.

– Rick James
Dec 29 '15 at 15:50

@RickJames there is only one update, and please check updated question., thanks

– AMB
Dec 30 '15 at 3:08

add a comment |

6 Answers
6

active

oldest

votes

Based on my experience, I would use LOAD DATA INFILE to import your CSV File.

The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.

Example I found on the internet Load Data example. I tested this example on my box and worked fine

Example Table

CREATE TABLE example (

  `Id` int(11) NOT NULL AUTO_INCREMENT,

  `Column2` varchar(14) NOT NULL,

  `Column3` varchar(14) NOT NULL,

  `Column4` varchar(14) NOT NULL,

  `Column5` DATE NOT NULL,

  PRIMARY KEY (`Id`)

) ENGINE=InnoDB

Example CSV File

# more /tmp/example.csv

Column1,Column2,Column3,Column4,Column5

1,A,Foo,sdsdsd,4/13/2013

2,B,Bar,sdsa,4/12/2013

3,C,Foo,wewqe,3/12/2013

4,D,Bar,asdsad,2/1/2013

5,E,FOObar,wewqe,5/1/2013

Import Statement to be run from MySQL console

LOAD DATA LOCAL INFILE '/tmp/example.csv'

    -> INTO TABLE example

    -> FIELDS TERMINATED BY ','

    -> LINES TERMINATED BY 'n'

    -> IGNORE 1 LINES

    -> (id, Column3,Column4, @Column5)

    -> set

    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');

Result

MySQL [testcsv]> select * from example;

+----+---------+---------+---------+------------+

| Id | Column2 | Column3 | Column4 | Column5    |

+----+---------+---------+---------+------------+

|  1 |         | Column2 | Column3 | 0000-00-00 |

|  2 |         | B       | Bar     | 0000-00-00 |

|  3 |         | C       | Foo     | 0000-00-00 |

|  4 |         | D       | Bar     | 0000-00-00 |

|  5 |         | E       | FOObar  | 0000-00-00 |

+----+---------+---------+---------+------------+

IGNORE just simply ignores the first line which are column headers.

After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.

Here is another example directly from Oracle : LOAD DATA INFILE example

This should be enough to get you started.

edited Oct 30 '15 at 12:59

answered Oct 30 '15 at 12:45

Craig Efrein

7,19784078

i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

– AMB
Oct 30 '15 at 13:11

add a comment |

+100

In light of all the things mentioned, it looks like the bottleneck is the join itself.

ASPECT #1 : Join Buffer Size

In all likelihood, your join_buffer_size is probably too low.

According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache

We only store the used columns in the join buffer, not the whole rows.

This being the case, make the keys of the join buffer stay in RAM.

You have 10 million rows times 4 bytes for each key. That's about 40M.

Try bumping it up in the session to 42M (a little bigger than 40M)

SET join_buffer_size = 1024 * 1024 * 42;

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4;

If this does the trick, proceed to add this to my.cnf

[mysqld]

join_buffer_size = 42M

Restarting mysqld not required for new connections. Just run

mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;

ASPECT #2 : Join Operation

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.

[mysqld]

innodb_write_io_threads = 16

You will have to restart MySQL for this change

GIVE IT A TRY !!!

answered Dec 28 '15 at 21:56

RolandoMySQLDBA

141k24219374

Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

– Peter Dixon-Moses
Dec 29 '15 at 20:26

add a comment |

CREATE TABLE that matches the CSV

LOAD DATA into that table

UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

DROP TABLE csv_table;

Step 3 will be a lot faster than row-by-row, but it will still lock all the rows in the table for a non-trivial amount of time. If this lock time is more important than how long the entire process takes then, ...

If nothing else is writing to the table, then...

CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.

LOAD DATA into that table

copy the real_table to new_table (CREATE ... SELECT)

UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

RENAME TABLE real_table TO old, new_table TO real_table;

DROP TABLE csv_table, old;

Step 3 is faster than the update, especially if unnecessary indexes are left off.

Step 5 is "instantaneous".

answered Dec 6 '15 at 7:49

Rick James

41.7k22258

lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

– AMB
Dec 25 '15 at 3:14

See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

– Rick James
Dec 25 '15 at 8:39

You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

– Rick James
Dec 29 '15 at 6:40

Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

– Rick James
Dec 29 '15 at 15:53

i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

– AMB
Dec 30 '15 at 3:11

|
show 3 more comments

You've said:

Updates affect 6-25% of your table

You want to do this as fast as possible (<1hr)

without locking

it doesn't have to be in a single transaction

yet (in comment on Rick James answer), you express concern about race conditions

Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.

Also, since your table is heavily indexed, both inserts and updates can be slow.

Avoiding race conditions

If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.

This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.

You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])

And, importantly, this lets you run parallel updates.

Running as fast as possible —Parallelizing

With this time stamp check now in place:

Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

In parallel, once (2) finishes, have mysql run each sql file.

(e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)

answered Dec 28 '15 at 21:07

Peter Dixon-Moses

1412

Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

– Rick James
Dec 29 '15 at 6:37

You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

– Peter Dixon-Moses
Dec 29 '15 at 11:37

I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

– Peter Dixon-Moses
Dec 29 '15 at 11:58

add a comment |

For the UPDATE to run fast, you need

INDEX(uploaders, downloaders, verified)

It can be on either table. The three fields can be in any order.

That will facilitate the UPDATE being able to quickly match up the rows between the two tables.

And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).

answered Dec 30 '15 at 16:03

Rick James

41.7k22258

this actually slowed down the update.

– AMB
Jan 1 '16 at 3:34

Hmmm... Please provide EXPLAIN UPDATE ...;.

– Rick James
Jan 1 '16 at 4:59

add a comment |

Large updates are I/O bound. I would suggest:

Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).

answered 5 mins ago

user3127882

New contributor

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f119621%2fhow-to-update-10-million-rows-in-mysql-single-table-as-fast-as-possible%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

6 Answers
6

active

oldest

votes

6 Answers
6

active

oldest

votes

Based on my experience, I would use LOAD DATA INFILE to import your CSV File.

The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.

Example I found on the internet Load Data example. I tested this example on my box and worked fine

Example Table

CREATE TABLE example (

  `Id` int(11) NOT NULL AUTO_INCREMENT,

  `Column2` varchar(14) NOT NULL,

  `Column3` varchar(14) NOT NULL,

  `Column4` varchar(14) NOT NULL,

  `Column5` DATE NOT NULL,

  PRIMARY KEY (`Id`)

) ENGINE=InnoDB

Example CSV File

# more /tmp/example.csv

Column1,Column2,Column3,Column4,Column5

1,A,Foo,sdsdsd,4/13/2013

2,B,Bar,sdsa,4/12/2013

3,C,Foo,wewqe,3/12/2013

4,D,Bar,asdsad,2/1/2013

5,E,FOObar,wewqe,5/1/2013

Import Statement to be run from MySQL console

LOAD DATA LOCAL INFILE '/tmp/example.csv'

    -> INTO TABLE example

    -> FIELDS TERMINATED BY ','

    -> LINES TERMINATED BY 'n'

    -> IGNORE 1 LINES

    -> (id, Column3,Column4, @Column5)

    -> set

    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');

Result

MySQL [testcsv]> select * from example;

+----+---------+---------+---------+------------+

| Id | Column2 | Column3 | Column4 | Column5    |

+----+---------+---------+---------+------------+

|  1 |         | Column2 | Column3 | 0000-00-00 |

|  2 |         | B       | Bar     | 0000-00-00 |

|  3 |         | C       | Foo     | 0000-00-00 |

|  4 |         | D       | Bar     | 0000-00-00 |

|  5 |         | E       | FOObar  | 0000-00-00 |

+----+---------+---------+---------+------------+

IGNORE just simply ignores the first line which are column headers.

After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.

Here is another example directly from Oracle : LOAD DATA INFILE example

This should be enough to get you started.

edited Oct 30 '15 at 12:59

answered Oct 30 '15 at 12:45

Craig Efrein

7,19784078

i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

– AMB
Oct 30 '15 at 13:11

add a comment |

Based on my experience, I would use LOAD DATA INFILE to import your CSV File.

The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.

Example I found on the internet Load Data example. I tested this example on my box and worked fine

Example Table

CREATE TABLE example (

  `Id` int(11) NOT NULL AUTO_INCREMENT,

  `Column2` varchar(14) NOT NULL,

  `Column3` varchar(14) NOT NULL,

  `Column4` varchar(14) NOT NULL,

  `Column5` DATE NOT NULL,

  PRIMARY KEY (`Id`)

) ENGINE=InnoDB

Example CSV File

# more /tmp/example.csv

Column1,Column2,Column3,Column4,Column5

1,A,Foo,sdsdsd,4/13/2013

2,B,Bar,sdsa,4/12/2013

3,C,Foo,wewqe,3/12/2013

4,D,Bar,asdsad,2/1/2013

5,E,FOObar,wewqe,5/1/2013

Import Statement to be run from MySQL console

LOAD DATA LOCAL INFILE '/tmp/example.csv'

    -> INTO TABLE example

    -> FIELDS TERMINATED BY ','

    -> LINES TERMINATED BY 'n'

    -> IGNORE 1 LINES

    -> (id, Column3,Column4, @Column5)

    -> set

    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');

Result

MySQL [testcsv]> select * from example;

+----+---------+---------+---------+------------+

| Id | Column2 | Column3 | Column4 | Column5    |

+----+---------+---------+---------+------------+

|  1 |         | Column2 | Column3 | 0000-00-00 |

|  2 |         | B       | Bar     | 0000-00-00 |

|  3 |         | C       | Foo     | 0000-00-00 |

|  4 |         | D       | Bar     | 0000-00-00 |

|  5 |         | E       | FOObar  | 0000-00-00 |

+----+---------+---------+---------+------------+

IGNORE just simply ignores the first line which are column headers.

After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.

Here is another example directly from Oracle : LOAD DATA INFILE example

This should be enough to get you started.

edited Oct 30 '15 at 12:59

answered Oct 30 '15 at 12:45

Craig Efrein

7,19784078

i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

– AMB
Oct 30 '15 at 13:11

add a comment |

Based on my experience, I would use LOAD DATA INFILE to import your CSV File.

The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.

Example I found on the internet Load Data example. I tested this example on my box and worked fine

Example Table

CREATE TABLE example (

  `Id` int(11) NOT NULL AUTO_INCREMENT,

  `Column2` varchar(14) NOT NULL,

  `Column3` varchar(14) NOT NULL,

  `Column4` varchar(14) NOT NULL,

  `Column5` DATE NOT NULL,

  PRIMARY KEY (`Id`)

) ENGINE=InnoDB

Example CSV File

# more /tmp/example.csv

Column1,Column2,Column3,Column4,Column5

1,A,Foo,sdsdsd,4/13/2013

2,B,Bar,sdsa,4/12/2013

3,C,Foo,wewqe,3/12/2013

4,D,Bar,asdsad,2/1/2013

5,E,FOObar,wewqe,5/1/2013

Import Statement to be run from MySQL console

LOAD DATA LOCAL INFILE '/tmp/example.csv'

    -> INTO TABLE example

    -> FIELDS TERMINATED BY ','

    -> LINES TERMINATED BY 'n'

    -> IGNORE 1 LINES

    -> (id, Column3,Column4, @Column5)

    -> set

    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');

Result

MySQL [testcsv]> select * from example;

+----+---------+---------+---------+------------+

| Id | Column2 | Column3 | Column4 | Column5    |

+----+---------+---------+---------+------------+

|  1 |         | Column2 | Column3 | 0000-00-00 |

|  2 |         | B       | Bar     | 0000-00-00 |

|  3 |         | C       | Foo     | 0000-00-00 |

|  4 |         | D       | Bar     | 0000-00-00 |

|  5 |         | E       | FOObar  | 0000-00-00 |

+----+---------+---------+---------+------------+

IGNORE just simply ignores the first line which are column headers.

After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.

Here is another example directly from Oracle : LOAD DATA INFILE example

This should be enough to get you started.

edited Oct 30 '15 at 12:59

answered Oct 30 '15 at 12:45

Craig Efrein

7,19784078

Based on my experience, I would use LOAD DATA INFILE to import your CSV File.

The LOAD DATA INFILE statement reads rows from a text file into a
table at a very high speed.

Example I found on the internet Load Data example. I tested this example on my box and worked fine

Example Table

CREATE TABLE example (

  `Id` int(11) NOT NULL AUTO_INCREMENT,

  `Column2` varchar(14) NOT NULL,

  `Column3` varchar(14) NOT NULL,

  `Column4` varchar(14) NOT NULL,

  `Column5` DATE NOT NULL,

  PRIMARY KEY (`Id`)

) ENGINE=InnoDB

Example CSV File

# more /tmp/example.csv

Column1,Column2,Column3,Column4,Column5

1,A,Foo,sdsdsd,4/13/2013

2,B,Bar,sdsa,4/12/2013

3,C,Foo,wewqe,3/12/2013

4,D,Bar,asdsad,2/1/2013

5,E,FOObar,wewqe,5/1/2013

Import Statement to be run from MySQL console

LOAD DATA LOCAL INFILE '/tmp/example.csv'

    -> INTO TABLE example

    -> FIELDS TERMINATED BY ','

    -> LINES TERMINATED BY 'n'

    -> IGNORE 1 LINES

    -> (id, Column3,Column4, @Column5)

    -> set

    -> Column5 = str_to_date(@Column5, '%m/%d/%Y');

Result

MySQL [testcsv]> select * from example;

+----+---------+---------+---------+------------+

| Id | Column2 | Column3 | Column4 | Column5    |

+----+---------+---------+---------+------------+

|  1 |         | Column2 | Column3 | 0000-00-00 |

|  2 |         | B       | Bar     | 0000-00-00 |

|  3 |         | C       | Foo     | 0000-00-00 |

|  4 |         | D       | Bar     | 0000-00-00 |

|  5 |         | E       | FOObar  | 0000-00-00 |

+----+---------+---------+---------+------------+

IGNORE just simply ignores the first line which are column headers.

After IGNORE, we are specifying the columns (skipping column2), to import, which matches one of the criteria in your question.

Here is another example directly from Oracle : LOAD DATA INFILE example

This should be enough to get you started.

edited Oct 30 '15 at 12:59

answered Oct 30 '15 at 12:45

Craig Efrein

7,19784078

edited Oct 30 '15 at 12:59

answered Oct 30 '15 at 12:45

Craig Efrein

7,19784078

answered Oct 30 '15 at 12:45

Craig Efrein

7,19784078

answered Oct 30 '15 at 12:45

Craig Efrein

7,19784078

i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

– AMB
Oct 30 '15 at 13:11

add a comment |

i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

– AMB
Oct 30 '15 at 13:11

i could use load data for loading data in temp table and then use other queries to update it in main table., thanks

– AMB
Oct 30 '15 at 13:11

add a comment |

+100

In light of all the things mentioned, it looks like the bottleneck is the join itself.

ASPECT #1 : Join Buffer Size

In all likelihood, your join_buffer_size is probably too low.

According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache

We only store the used columns in the join buffer, not the whole rows.

This being the case, make the keys of the join buffer stay in RAM.

You have 10 million rows times 4 bytes for each key. That's about 40M.

Try bumping it up in the session to 42M (a little bigger than 40M)

SET join_buffer_size = 1024 * 1024 * 42;

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4;

If this does the trick, proceed to add this to my.cnf

[mysqld]

join_buffer_size = 42M

Restarting mysqld not required for new connections. Just run

mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;

ASPECT #2 : Join Operation

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.

[mysqld]

innodb_write_io_threads = 16

You will have to restart MySQL for this change

GIVE IT A TRY !!!

answered Dec 28 '15 at 21:56

RolandoMySQLDBA

141k24219374

Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

– Peter Dixon-Moses
Dec 29 '15 at 20:26

add a comment |

+100

In light of all the things mentioned, it looks like the bottleneck is the join itself.

ASPECT #1 : Join Buffer Size

In all likelihood, your join_buffer_size is probably too low.

According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache

We only store the used columns in the join buffer, not the whole rows.

This being the case, make the keys of the join buffer stay in RAM.

You have 10 million rows times 4 bytes for each key. That's about 40M.

Try bumping it up in the session to 42M (a little bigger than 40M)

SET join_buffer_size = 1024 * 1024 * 42;

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4;

If this does the trick, proceed to add this to my.cnf

[mysqld]

join_buffer_size = 42M

Restarting mysqld not required for new connections. Just run

mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;

ASPECT #2 : Join Operation

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.

[mysqld]

innodb_write_io_threads = 16

You will have to restart MySQL for this change

GIVE IT A TRY !!!

answered Dec 28 '15 at 21:56

RolandoMySQLDBA

141k24219374

Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

– Peter Dixon-Moses
Dec 29 '15 at 20:26

add a comment |

+100

In light of all the things mentioned, it looks like the bottleneck is the join itself.

ASPECT #1 : Join Buffer Size

In all likelihood, your join_buffer_size is probably too low.

According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache

We only store the used columns in the join buffer, not the whole rows.

This being the case, make the keys of the join buffer stay in RAM.

You have 10 million rows times 4 bytes for each key. That's about 40M.

Try bumping it up in the session to 42M (a little bigger than 40M)

SET join_buffer_size = 1024 * 1024 * 42;

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4;

If this does the trick, proceed to add this to my.cnf

[mysqld]

join_buffer_size = 42M

Restarting mysqld not required for new connections. Just run

mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;

ASPECT #2 : Join Operation

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.

[mysqld]

innodb_write_io_threads = 16

You will have to restart MySQL for this change

GIVE IT A TRY !!!

answered Dec 28 '15 at 21:56

RolandoMySQLDBA

141k24219374

In light of all the things mentioned, it looks like the bottleneck is the join itself.

ASPECT #1 : Join Buffer Size

In all likelihood, your join_buffer_size is probably too low.

According to the MySQL Documentation on How MySQL Uses the Join Buffer Cache

We only store the used columns in the join buffer, not the whole rows.

This being the case, make the keys of the join buffer stay in RAM.

You have 10 million rows times 4 bytes for each key. That's about 40M.

Try bumping it up in the session to 42M (a little bigger than 40M)

SET join_buffer_size = 1024 * 1024 * 42;

UPDATE table1 a JOIN table2 b 

ON a.field1 = b.field1 

SET 

a.field2 = b.field2,

a.field3 = b.field3,

a.field4 = b.field4;

If this does the trick, proceed to add this to my.cnf

[mysqld]

join_buffer_size = 42M

Restarting mysqld not required for new connections. Just run

mysql> SET GLOBAL join_buffer_size = 1024 * 1024 * 42;

ASPECT #2 : Join Operation

You could manipulate the style of the join operation by tweeking the optimizer

According to MySQL Documentation on Block Nested-Loop and Batched Key Access Joins

When BKA is used, the value of join_buffer_size defines how large the batch of keys is in each request to the storage engine. The larger the buffer, the more sequential access will be to the right hand table of a join operation, which can significantly improve performance.

For BKA to be used, the batched_key_access flag of the optimizer_switch system variable must be set to on. BKA uses MRR, so the mrr flag must also be on. Currently, the cost estimation for MRR is too pessimistic. Hence, it is also necessary for mrr_cost_based to be off for BKA to be used.

This same page recommends doing this:

mysql> SET optimizer_switch='mrr=on,mrr_cost_based=off,batched_key_access=on';

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

Most forget to increase the innodb_write_io_threads to write dirty pages out of the buffer pool faster.

[mysqld]

innodb_write_io_threads = 16

You will have to restart MySQL for this change

GIVE IT A TRY !!!

answered Dec 28 '15 at 21:56

RolandoMySQLDBA

141k24219374

answered Dec 28 '15 at 21:56

RolandoMySQLDBA

141k24219374

answered Dec 28 '15 at 21:56

RolandoMySQLDBA

141k24219374

answered Dec 28 '15 at 21:56

RolandoMySQLDBA

141k24219374

Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

– Peter Dixon-Moses
Dec 29 '15 at 20:26

add a comment |

Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

– Peter Dixon-Moses
Dec 29 '15 at 20:26

Nice! +1 for the tunable join buffer tip. If you gotta join, join in memory. Good tip!

– Peter Dixon-Moses
Dec 29 '15 at 20:26

add a comment |

CREATE TABLE that matches the CSV

LOAD DATA into that table

UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

DROP TABLE csv_table;

If nothing else is writing to the table, then...

CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.

LOAD DATA into that table

copy the real_table to new_table (CREATE ... SELECT)

UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

RENAME TABLE real_table TO old, new_table TO real_table;

DROP TABLE csv_table, old;

Step 3 is faster than the update, especially if unnecessary indexes are left off.

Step 5 is "instantaneous".

answered Dec 6 '15 at 7:49

Rick James

41.7k22258

lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

– AMB
Dec 25 '15 at 3:14

See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

– Rick James
Dec 25 '15 at 8:39

You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

– Rick James
Dec 29 '15 at 6:40

Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

– Rick James
Dec 29 '15 at 15:53

i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

– AMB
Dec 30 '15 at 3:11

|
show 3 more comments

CREATE TABLE that matches the CSV

LOAD DATA into that table

UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

DROP TABLE csv_table;

If nothing else is writing to the table, then...

CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.

LOAD DATA into that table

copy the real_table to new_table (CREATE ... SELECT)

UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

RENAME TABLE real_table TO old, new_table TO real_table;

DROP TABLE csv_table, old;

Step 3 is faster than the update, especially if unnecessary indexes are left off.

Step 5 is "instantaneous".

answered Dec 6 '15 at 7:49

Rick James

41.7k22258

lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

– AMB
Dec 25 '15 at 3:14

See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

– Rick James
Dec 25 '15 at 8:39

You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

– Rick James
Dec 29 '15 at 6:40

Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

– Rick James
Dec 29 '15 at 15:53

i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

– AMB
Dec 30 '15 at 3:11

|
show 3 more comments

CREATE TABLE that matches the CSV

LOAD DATA into that table

UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

DROP TABLE csv_table;

If nothing else is writing to the table, then...

CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.

LOAD DATA into that table

copy the real_table to new_table (CREATE ... SELECT)

UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

RENAME TABLE real_table TO old, new_table TO real_table;

DROP TABLE csv_table, old;

Step 3 is faster than the update, especially if unnecessary indexes are left off.

Step 5 is "instantaneous".

answered Dec 6 '15 at 7:49

Rick James

41.7k22258

CREATE TABLE that matches the CSV

LOAD DATA into that table

UPDATE real_table JOIN csv_table ON ... SET ..., ..., ...;

DROP TABLE csv_table;

If nothing else is writing to the table, then...

CREATE TABLE that matches the CSV; no indexes except what is needed in the JOIN in the UPDATE. If unique, make it PRIMARY KEY.

LOAD DATA into that table

copy the real_table to new_table (CREATE ... SELECT)

UPDATE new_table JOIN csv_table ON ... SET ..., ..., ...;

RENAME TABLE real_table TO old, new_table TO real_table;

DROP TABLE csv_table, old;

Step 3 is faster than the update, especially if unnecessary indexes are left off.

Step 5 is "instantaneous".

answered Dec 6 '15 at 7:49

Rick James

41.7k22258

answered Dec 6 '15 at 7:49

Rick James

41.7k22258

answered Dec 6 '15 at 7:49

Rick James

41.7k22258

answered Dec 6 '15 at 7:49

Rick James

41.7k22258

lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

– AMB
Dec 25 '15 at 3:14

See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

– Rick James
Dec 25 '15 at 8:39

You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

– Rick James
Dec 29 '15 at 6:40

Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

– Rick James
Dec 29 '15 at 15:53

i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

– AMB
Dec 30 '15 at 3:11

|
show 3 more comments

lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

– AMB
Dec 25 '15 at 3:14

See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

– Rick James
Dec 25 '15 at 8:39

You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

– Rick James
Dec 29 '15 at 6:40

Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

– Rick James
Dec 29 '15 at 15:53

i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

– AMB
Dec 30 '15 at 3:11

lets say in seconds example, after step 3, we are doing step 4, then the new data gets inserted in the real_table so we will miss that data in the new_table ? what is workaround for this ? thanks

– AMB
Dec 25 '15 at 3:14

See what pt-online-schema-digest; it takes care of such issues via a TRIGGER.

– Rick James
Dec 25 '15 at 8:39

You probably do not need any indexes on the table from LOAD DATA. Adding unnecessary indexes is costly (in time).

– Rick James
Dec 29 '15 at 6:40

Based on the latest info, I am leaning toward the CSV file being loaded into a MyISAM table with just an AUTO_INCREMENT, then chunking 1K rows at a time based on the PK. But I need to see all the requirements and the table schema before trying to spell out the details.

– Rick James
Dec 29 '15 at 15:53

i have set hash as PRIMARY index, but while chunking in 50k using order query takes more time., would it be better if i create auto increment ? and set it as PRIMARY index ?

– AMB
Dec 30 '15 at 3:11

|
show 3 more comments

You've said:

Updates affect 6-25% of your table

You want to do this as fast as possible (<1hr)

without locking

it doesn't have to be in a single transaction

yet (in comment on Rick James answer), you express concern about race conditions

Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.

Also, since your table is heavily indexed, both inserts and updates can be slow.

Avoiding race conditions

If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.

This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.

You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])

And, importantly, this lets you run parallel updates.

Running as fast as possible —Parallelizing

With this time stamp check now in place:

Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

In parallel, once (2) finishes, have mysql run each sql file.

(e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)

answered Dec 28 '15 at 21:07

Peter Dixon-Moses

1412

Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

– Rick James
Dec 29 '15 at 6:37

You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

– Peter Dixon-Moses
Dec 29 '15 at 11:37

I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

– Peter Dixon-Moses
Dec 29 '15 at 11:58

add a comment |

You've said:

Updates affect 6-25% of your table

You want to do this as fast as possible (<1hr)

without locking

it doesn't have to be in a single transaction

yet (in comment on Rick James answer), you express concern about race conditions

Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.

Also, since your table is heavily indexed, both inserts and updates can be slow.

Avoiding race conditions

If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.

This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.

You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])

And, importantly, this lets you run parallel updates.

Running as fast as possible —Parallelizing

With this time stamp check now in place:

Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

In parallel, once (2) finishes, have mysql run each sql file.

(e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)

answered Dec 28 '15 at 21:07

Peter Dixon-Moses

1412

Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

– Rick James
Dec 29 '15 at 6:37

You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

– Peter Dixon-Moses
Dec 29 '15 at 11:37

I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

– Peter Dixon-Moses
Dec 29 '15 at 11:58

add a comment |

You've said:

Updates affect 6-25% of your table

You want to do this as fast as possible (<1hr)

without locking

it doesn't have to be in a single transaction

yet (in comment on Rick James answer), you express concern about race conditions

Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.

Also, since your table is heavily indexed, both inserts and updates can be slow.

Avoiding race conditions

If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.

This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.

You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])

And, importantly, this lets you run parallel updates.

Running as fast as possible —Parallelizing

With this time stamp check now in place:

Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

In parallel, once (2) finishes, have mysql run each sql file.

(e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)

answered Dec 28 '15 at 21:07

Peter Dixon-Moses

1412

You've said:

Updates affect 6-25% of your table

You want to do this as fast as possible (<1hr)

without locking

it doesn't have to be in a single transaction

yet (in comment on Rick James answer), you express concern about race conditions

Many of these statements can be contradictory. For example, large updates w/o locking the table. Or avoiding race conditions w/o using one giant transaction.

Also, since your table is heavily indexed, both inserts and updates can be slow.

Avoiding race conditions

If you're able to add an updated time stamp to your table, you can solve for race conditions while also avoiding logging a half-million updates in a single transaction.

This frees you to perform line-by-line updates (as you currently do), but with autocommit or more reasonable transaction batches.

You avoid race conditions (while updating line-by-line) by performing a check that a later update has not already occurred (UPDATE ... WHERE pk = [pk] AND updated < [batchfile date])

And, importantly, this lets you run parallel updates.

Running as fast as possible —Parallelizing

With this time stamp check now in place:

Split your batch file into some reasonable sized chunks (say 50,000 rows/file)

In parallel, have a script read in each file and output a file with 50,000 UPDATE statements.

In parallel, once (2) finishes, have mysql run each sql file.

(e.g. In bash look at split and xargs -P for ways to easily run a command many ways parallel. Degree of parallelism depends upon how many threads you're willing to dedicate to the update)

answered Dec 28 '15 at 21:07

Peter Dixon-Moses

1412

answered Dec 28 '15 at 21:07

Peter Dixon-Moses

1412

answered Dec 28 '15 at 21:07

Peter Dixon-Moses

1412

answered Dec 28 '15 at 21:07

Peter Dixon-Moses

1412

Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

– Rick James
Dec 29 '15 at 6:37

You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

– Peter Dixon-Moses
Dec 29 '15 at 11:37

I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

– Peter Dixon-Moses
Dec 29 '15 at 11:58

add a comment |

Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

– Rick James
Dec 29 '15 at 6:37

You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

– Peter Dixon-Moses
Dec 29 '15 at 11:37

I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

– Peter Dixon-Moses
Dec 29 '15 at 11:58

Keep in mind that "line-by-line" is likely to be 10x slower than doing things in batches of at least 100.

– Rick James
Dec 29 '15 at 6:37

You'd have to benchmark it in this case to be sure. Updating 6-25% of a table (with 8 indexes involved with the updated columns), I'd entertain the possibility that index maintenance becomes the bottleneck.

– Peter Dixon-Moses
Dec 29 '15 at 11:37

I mean, in some cases it might be faster to drop indices, bulk update, and recreate them after... but OP doesn't want downtime.

– Peter Dixon-Moses
Dec 29 '15 at 11:58

add a comment |

For the UPDATE to run fast, you need

INDEX(uploaders, downloaders, verified)

It can be on either table. The three fields can be in any order.

That will facilitate the UPDATE being able to quickly match up the rows between the two tables.

And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).

answered Dec 30 '15 at 16:03

Rick James

41.7k22258

this actually slowed down the update.

– AMB
Jan 1 '16 at 3:34

Hmmm... Please provide EXPLAIN UPDATE ...;.

– Rick James
Jan 1 '16 at 4:59

add a comment |

For the UPDATE to run fast, you need

INDEX(uploaders, downloaders, verified)

It can be on either table. The three fields can be in any order.

That will facilitate the UPDATE being able to quickly match up the rows between the two tables.

And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).

answered Dec 30 '15 at 16:03

Rick James

41.7k22258

this actually slowed down the update.

– AMB
Jan 1 '16 at 3:34

Hmmm... Please provide EXPLAIN UPDATE ...;.

– Rick James
Jan 1 '16 at 4:59

add a comment |

For the UPDATE to run fast, you need

INDEX(uploaders, downloaders, verified)

It can be on either table. The three fields can be in any order.

That will facilitate the UPDATE being able to quickly match up the rows between the two tables.

And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).

answered Dec 30 '15 at 16:03

Rick James

41.7k22258

For the UPDATE to run fast, you need

INDEX(uploaders, downloaders, verified)

It can be on either table. The three fields can be in any order.

That will facilitate the UPDATE being able to quickly match up the rows between the two tables.

And make the datatypes the same in the two tables (both INT SIGNED or both INT UNSIGNED).

answered Dec 30 '15 at 16:03

Rick James

41.7k22258

answered Dec 30 '15 at 16:03

Rick James

41.7k22258

answered Dec 30 '15 at 16:03

Rick James

41.7k22258

answered Dec 30 '15 at 16:03

Rick James

41.7k22258

this actually slowed down the update.

– AMB
Jan 1 '16 at 3:34

Hmmm... Please provide EXPLAIN UPDATE ...;.

– Rick James
Jan 1 '16 at 4:59

add a comment |

this actually slowed down the update.

– AMB
Jan 1 '16 at 3:34

Hmmm... Please provide EXPLAIN UPDATE ...;.

– Rick James
Jan 1 '16 at 4:59

this actually slowed down the update.

– AMB
Jan 1 '16 at 3:34

Hmmm... Please provide EXPLAIN UPDATE ...;.

– Rick James
Jan 1 '16 at 4:59

add a comment |

Large updates are I/O bound. I would suggest:

Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).

answered 5 mins ago

user3127882

New contributor

add a comment |

Large updates are I/O bound. I would suggest:

Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).

answered 5 mins ago

user3127882

New contributor

add a comment |

Large updates are I/O bound. I would suggest:

Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).

answered 5 mins ago

user3127882

New contributor

Large updates are I/O bound. I would suggest:

Create a distinct table that will store your 3 frequently updated fields. Let's call one table assets_static where you keep, well, static data, and the other assets_dynamic that will store uploaders, downloaders and verified.

If you can, use the MEMORY engine for the assets_dynamic table. (backup to disk after each update).

Update your lightweight and nimble assets_dynamic as per your update 4 (i.e. LOAD INFILE ... INTO temp; UPDATE assets_dynamic a JOIN temp b on a.id=b.id SET [what has to be updated]. This should take less than a minute. (On our system, assets_dynamic has 95M rows and updates impact ~ 6M rows, in a little more than 40s.)

When you run Sphinx's indexer, JOIN assets_static and assets_dynamic (assuming that you want to use one of these fields as an attribute).

answered 5 mins ago

user3127882

New contributor

answered 5 mins ago

user3127882

New contributor

answered 5 mins ago

user3127882

answered 5 mins ago

user3127882

New contributor

user3127882 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Xe,6SL8axLWl6o,P3wjPiQA6V008BEb1QHDtLk1ooS7OHthlUyRl4rTiWIPsu25 DG49BJe,RYW,Rdq qeDI

How to update 10 million+ rows in MySQL single table as Fast as possible?

Update 1

Update 2

Update 1

Update 2

Update 1

Update 2

Update 1

Update 2

6 Answers 6

ASPECT #1 : Join Buffer Size

ASPECT #2 : Join Operation

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

GIVE IT A TRY !!!

Your Answer

Sign up or log in

Post as a guest

Post as a guest

6 Answers 6

6 Answers 6

ASPECT #1 : Join Buffer Size

ASPECT #2 : Join Operation

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

GIVE IT A TRY !!!

ASPECT #1 : Join Buffer Size

ASPECT #2 : Join Operation

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

GIVE IT A TRY !!!

ASPECT #1 : Join Buffer Size

ASPECT #2 : Join Operation

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

GIVE IT A TRY !!!

ASPECT #1 : Join Buffer Size

ASPECT #2 : Join Operation

ASPECT #3 : Writing Updates to Disk (OPTIONAL)

GIVE IT A TRY !!!

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

الفوسفات في المغرب

Four equal circles intersect: What is the area of the small shaded portion and its height

Always On Availability groups resolving state after failover - Remote harden of transaction...

6 Answers
6

6 Answers
6

6 Answers
6