Extract numbers from string using T-SQL

Rajnilari2015
Posted by in Sql Server category on for Intermediate level | Points: 250 | Views : 1129 red flag

In many high end programming languages like C#,Java,JavaScript etc. they provide a feature call as regular expression to do this kind of operation and it is quite handy also for this kind of job. However, SQL Server is doesn't provide such an option to do so. But with a little of effort and trick we can obtain that. This article is mainly focus on how to do so.


 Download source code for Extract numbers from string using T-SQL

Introduction

Sometimes we encounter the situation where we need to extract the numbers from the set of text. E.g. suppose we have a text like

My first mobile number:91161181100. Second mobile number:- 1111111111. Third one is: 1212121212

And we need to extract the mobile numbers in a way so the the final output must be

91161181100,1111111111,1212121212

In many high end programming languages like C#,Java,JavaScript etc. they provide a feature call as regular expression to do this kind of operation and it is quite handy also for this kind of job.However, SQL Server is doesn't provide such an option to do so.But with a little of effort and trick we can obtain that.This article is mainly focus on how to do so.

Environment Setup

So let us setup the environment first

DECLARE @T TABLE(Name VARCHAR(50),Description VARCHAR(500))
INSERT INTO @T 
	SELECT 'Name1','My first mobile number:91161181100. Second mobile number:- 1111111111. Third one is: 1212121212' UNION ALL
	SELECT 'Name2','Hello friends!!!.I have four mobile phones.The numbers are 234567890,123456789,456789034,098765431' UNION ALL
	SELECT 'Name3','My number is123456789.I have two more:23456789 and567891234numbers'

SELECT *
FROM @T

Yields

A careful observation can help us to look into the way the texts are arranged.In the first row, the numbers are arranged in three different locations.In the second case they are located next to each other but are separated by commas while in the last one , the numbers and the alphabets are aligned to each other.In another word, it is highly unstructured data.However, we need to process it to figure out the respective and needed numbers and the presentation should be as under

Using the code

In order to solve this problem,first we will split the textual data with space(' ') as the delimiter as shown under

DECLARE @PATTERN_TO_FIND  VARCHAR(20) = '%[0-9]%'
;WITH SplitCTE AS(
	SELECT 
		X.Name, 
		Y.SplitData 
		,Position = PATINDEX(@PATTERN_TO_FIND, Y.SplitData)
	FROM (
	SELECT 
	*,	
	CAST('<X>'+REPLACE(REPLACE(Description,' ',','),',','</X><X>')+'</X>' AS XML) AS ColXML 
	FROM @T 
	)X CROSS APPLY ( 
	SELECT data.d.value('.','varchar(50)') AS SplitData 
	FROM X.ColXML.nodes('X') AS data(d)) Y)
 SELECT *
 FROM SplitCTE

The output is as expected

Name	SplitData		Position
Name1	My			0
Name1	first			0
Name1	mobile			0
Name1	number:91161181100.	8
Name1	Second			0
Name1	mobile			0
Name1	number:-		0
Name1	1111111111.		1
Name1	Third			0
Name1	one			0
Name1	is:			0
Name1	1212121212		1
Name2	Hello			0
Name2	friends!!!.I		0
Name2	have			0
Name2	four			0
Name2	mobile			0
Name2	phones.The		0
Name2	numbers			0
Name2	are			0
Name2	234567890,123456789,456789034,098765431	1
Name3	My			0
Name3	number			0
Name3	is123456789.I		3
Name3	have			0
Name3	two			0
Name3	more:23456789		6
Name3	and567891234numbers	4

As can be figure out that by using the PATINDEX, we can easily be able to detect out which texts are having alphanumeric characters and which are pure alphabets. So, as a second step, we will eliminate those which does have only alphabets as shown under

DECLARE @PATTERN_TO_FIND  VARCHAR(20) = '%[0-9]%'
;WITH SplitCTE AS(
	SELECT 
		X.Name, 
		Y.SplitData 
		,Position = PATINDEX(@PATTERN_TO_FIND, Y.SplitData)
	FROM (
			SELECT 
				*,	
	CAST('<X>'+REPLACE(REPLACE(Description,' ',','),',','</X><X>')+'</X>' AS XML) AS ColXML  
			FROM @T 
		)X CROSS APPLY ( 
		SELECT data.d.value('.','varchar(50)') AS SplitData 
		FROM X.ColXML.nodes('X') AS data(d)) Y
	WHERE PATINDEX(@PATTERN_TO_FIND, Y.SplitData) > 0)
 SELECT *
 FROM SplitCTE
 

This gives us the result as

The next task is to find out the numeric values which can be done by using the following SQL Code snippet

OnlyNumericCTE AS(
    SELECT 
        Name
,AllNumeric = LEFT(SUBSTRING(SplitData, Position, LEN(SplitData)), PATINDEX('%[^0-9]%', SUBSTRING(SplitData, Position, LEN(SplitData)) + 't') - 1)      
    FROM SplitCTE )

The output will be

So the query at this stage is

DECLARE @T TABLE(Name VARCHAR(50),Description VARCHAR(500))
INSERT INTO @T 
	SELECT 'Name1','My first mobile number:91161181100. Second mobile number:- 1111111111. Third one is: 1212121212' UNION ALL
	SELECT 'Name2','Hello friends!!!.I have four mobile phones.The numbers are 234567890,123456789,456789034,098765431' UNION ALL
	SELECT 'Name3','My number is123456789.I have two more:23456789 and567891234numbers'

DECLARE @PATTERN_TO_FIND  VARCHAR(20) = '%[0-9]%'
;WITH SplitCTE AS(
	SELECT 
		X.Name, 
		Y.SplitData 
		,Position = PATINDEX(@PATTERN_TO_FIND, Y.SplitData)		
	FROM (
			SELECT 
				*,					
CAST('<X>'+REPLACE(REPLACE(Description,' ',','),',','</X><X>')+'</X>' AS XML) AS ColXML 
			FROM @T 
		)X CROSS APPLY ( 
		SELECT data.d.value('.','varchar(50)') AS SplitData 
		FROM X.ColXML.nodes('X') AS data(d)) Y
		WHERE PATINDEX(@PATTERN_TO_FIND, Y.SplitData) > 0),
OnlyNumericCTE AS(
    SELECT 
        Name
,AllNumeric = LEFT(SUBSTRING(SplitData, Position, LEN(SplitData)), PATINDEX('%[^0-9]%', SUBSTRING(SplitData, Position, LEN(SplitData)) + 't') - 1)      
    FROM SplitCTE )
SELECT *
FROM OnlyNumericCTE
 

So now we are very close to our solution.As a final step we need to the FORXMLPATH and group by the items by using the 'Name' column

SELECT
		Name
		,STUFF(( SELECT ',' + c1.AllNumeric
			 FROM OnlyNumericCTE c1
			 WHERE c1.Name = c2.Name
			 FOR XML PATH(''),TYPE)
			 .value('.','NVARCHAR(MAX)'),1,1,'') AS MobileNumbers
FROM OnlyNumericCTE c2
GROUP BY Name

So the complete query at this stage is

DECLARE @T TABLE(Name VARCHAR(50),Description VARCHAR(500))
INSERT INTO @T 
	SELECT 'Name1','My first mobile number:91161181100. Second mobile number:- 1111111111. Third one is: 1212121212' UNION ALL
	SELECT 'Name2','Hello friends!!!.I have four mobile phones.The numbers are 234567890,123456789,456789034,098765431' UNION ALL
	SELECT 'Name3','My number is123456789.I have two more:23456789 and567891234numbers'

DECLARE @PATTERN_TO_FIND  VARCHAR(20) = '%[0-9]%'
;WITH SplitCTE AS(
	SELECT 
		X.Name, 
		Y.SplitData 
		,Position = PATINDEX(@PATTERN_TO_FIND, Y.SplitData)		
	FROM (
			SELECT 
				*,					
CAST('<X>'+REPLACE(REPLACE(Description,' ',','),',','</X><X>')+'</X>' AS XML) AS ColXML 
			FROM @T 
		)X CROSS APPLY ( 
		SELECT data.d.value('.','varchar(50)') AS SplitData 
		FROM X.ColXML.nodes('X') AS data(d)) Y
		WHERE PATINDEX(@PATTERN_TO_FIND, Y.SplitData) > 0),
OnlyNumericCTE AS(
    SELECT 
        Name
        ,AllNumeric = LEFT(SUBSTRING(SplitData, Position, LEN(SplitData)), PATINDEX('%[^0-9]%', SUBSTRING(SplitData, Position, LEN(SplitData)) + 't') - 1)      
    FROM SplitCTE )

SELECT
		Name
		,STUFF(( SELECT ',' + c1.AllNumeric
			 FROM OnlyNumericCTE c1
			 WHERE c1.Name = c2.Name
			 FOR XML PATH(''),TYPE)
			 .value('.','NVARCHAR(MAX)'),1,1,'') AS MobileNumbers
FROM OnlyNumericCTE c2
GROUP BY Name

The entire query can be reduced to

DECLARE @PATTERN_TO_FIND  VARCHAR(20) = '%[0-9]%'
;WITH onlyNumericCTE AS(
	SELECT 
		X.Name		
		,AllNumeric = LEFT(SUBSTRING(Y.SplitData, PATINDEX(@PATTERN_TO_FIND, Y.SplitData), LEN(Y.SplitData)), PATINDEX('%[^0-9]%', SUBSTRING(Y.SplitData, PATINDEX(@PATTERN_TO_FIND, Y.SplitData), LEN(Y.SplitData)) + 't') - 1) 	
	FROM (
			SELECT 
				*,					
CAST('<X>'+REPLACE(REPLACE(Description,' ',','),',','</X><X>')+'</X>' AS XML) AS ColXML 
			FROM @T 
		)X CROSS APPLY ( 
		SELECT data.d.value('.','varchar(50)') AS SplitData 
		FROM X.ColXML.nodes('X') AS data(d)) Y
		WHERE PATINDEX(@PATTERN_TO_FIND, Y.SplitData) > 0)
SELECT
		Name
		,STUFF(( SELECT ',' + c1.AllNumeric
			 FROM onlyNumericCTE c1
			 WHERE c1.Name = c2.Name
			 FOR XML PATH(''),TYPE)
			 .value('.','NVARCHAR(MAX)'),1,1,'') AS MobileNumbers
FROM onlyNumericCTE c2
GROUP BY Name

Reference

PATINDEX

Using APPLY

Conclusion

Though we have seen the way to extracting numeric values from textual data, however, we will not recommend T-SQL for doing this kind of operation as huge/heavy string processing is not the role for which Sql-Server is meant for.Better to use the regular expression technique of the programming languages like C#,Java,JavaScript etc.Thanks for reading.Zipped file is attached herewith.

Page copy protected against web site content infringement by Copyscape

About the Author

Rajnilari2015
Full Name: Niladri Biswas (RNA Team)
Member Level: Platinum
Member Status: Member,Microsoft_MVP,MVP
Member Since: 3/17/2015 2:41:06 AM
Country: India
-- Thanks & Regards, RNA Team


Login to vote for this post.

Comments or Responses

Login to post response

Comment using Facebook(Author doesn't get notification)