Systems Engineering and RDBMS

Separating numbers and text in a string column

Posted by decipherinfosys on June 11, 2009

Ran into an issue at a client site last week where they were getting data from a legacy system and the legacy system had concatenated data from three fields – two were number fields which really should have been one single field – they were entering a number and then when the order was processed, entered another number which got concatenated to whatever was in that field.  The other was a string field.  The result was that a single string field was being sent over without any fixed pattern to it – there were numbers and text intermixed.  The aim was to separate them out in a properly normalized fashion.  Here is an example of the way the data was being sent:

SET NOCOUNT ON
GO
declare @tblmixed table (ID int identity, col1 nvarchar(100));
insert into @tblmixed values (N'96578Whitelabel')
insert into @tblmixed values (N'101Bluelabel567890')
insert into @tblmixed values (N'Red15label46890')

Aim was to separate them out so that we had one number field (in the order that the numbers appeared in the text field) and the other as a string.  We will make use of the number table technique that was introduced by Microsoft MVP Itzik and that we have also covered in our posts before – here and here.   So, here goes:

with
N1 (n) AS (SELECT 1 UNION ALL SELECT 1),
N2 (n) AS (SELECT 1 FROM N1 AS X, N1 AS Y),
N3 (n) AS (SELECT 1 FROM N2 AS X, N2 AS Y),
N4 (n) AS (SELECT 1 FROM N3 AS X, N3 AS Y),
N5 (n) AS (SELECT ROW_NUMBER() OVER(ORDER BY X.n%10) FROM N4 AS X, N4 AS Y)
select        n
,   p.ID   as ID
,    p.Col1 as Col1
,case
when substring(p.Col1, n, 1) like '[0-9]'
then substring(p.Col1, n, 1)
else ''
end as Col1_Nbr
,case
when substring(p.Col1, n, 1) not like '[0-9]'
then substring(p.Col1, n, 1)
else ''
end as Col1_Str
from N5
join @tblmixed as p
on N5.n <= Len(p.col1)
order by p.id, n

The output of this will be:

n                    ID          Col1                                                                                                 Col1_Nbr Col1_Str
-------------------- ----------- ---------------------------------------------------------------------------------------------------- -------- --------
1                    1           96578Whitelabel                                                                                      9
2                    1           96578Whitelabel                                                                                      6
3                    1           96578Whitelabel                                                                                      5
4                    1           96578Whitelabel                                                                                      7
5                    1           96578Whitelabel                                                                                      8
6                    1           96578Whitelabel                                                                                               W
7                    1           96578Whitelabel                                                                                               h
8                    1           96578Whitelabel                                                                                               i
9                    1           96578Whitelabel                                                                                               t
10                   1           96578Whitelabel                                                                                               e
11                   1           96578Whitelabel                                                                                               l
12                   1           96578Whitelabel                                                                                               a
13                   1           96578Whitelabel                                                                                               b
14                   1           96578Whitelabel                                                                                               e
15                   1           96578Whitelabel                                                                                               l
1                    2           101Bluelabel567890                                                                                   1
2                    2           101Bluelabel567890                                                                                   0
3                    2           101Bluelabel567890                                                                                   1
4                    2           101Bluelabel567890                                                                                            B
5                    2           101Bluelabel567890                                                                                            l
6                    2           101Bluelabel567890                                                                                            u
7                    2           101Bluelabel567890                                                                                            e
8                    2           101Bluelabel567890                                                                                            l
9                    2           101Bluelabel567890                                                                                            a
10                   2           101Bluelabel567890                                                                                            b
11                   2           101Bluelabel567890                                                                                            e
12                   2           101Bluelabel567890                                                                                            l
13                   2           101Bluelabel567890                                                                                   5
14                   2           101Bluelabel567890                                                                                   6
15                   2           101Bluelabel567890                                                                                   7
16                   2           101Bluelabel567890                                                                                   8
17                   2           101Bluelabel567890                                                                                   9
18                   2           101Bluelabel567890                                                                                   0
1                    3           Red15label46890                                                                                               R
2                    3           Red15label46890                                                                                               e
3                    3           Red15label46890                                                                                               d
4                    3           Red15label46890                                                                                      1
5                    3           Red15label46890                                                                                      5
6                    3           Red15label46890                                                                                               l
7                    3           Red15label46890                                                                                               a
8                    3           Red15label46890                                                                                               b
9                    3           Red15label46890                                                                                               e
10                   3           Red15label46890                                                                                               l
11                   3           Red15label46890                                                                                      4
12                   3           Red15label46890                                                                                      6
13                   3           Red15label46890                                                                                      8
14                   3           Red15label46890                                                                                      9
15                   3           Red15label46890                                                                                      0

Now, you can see above that we have actually stripped the string position by position and have divided it up into 2 columns – one that contains the numbers and the otherr that contains the strings – all that remains to do now, is roll  it up so that per ID value, we can then have 2 columns – one for the numbers and the other for the strings.  In order to achieve that, we will use the same XML PATH (”) technique that we have used in the past and is a very popular technique to concatenate rows of column data into a single value:

with
N1 (n) AS (SELECT 1 UNION ALL SELECT 1),
N2 (n) AS (SELECT 1 FROM N1 AS X, N1 AS Y),
N3 (n) AS (SELECT 1 FROM N2 AS X, N2 AS Y),
N4 (n) AS (SELECT 1 FROM N3 AS X, N3 AS Y),
N5 (n) AS (SELECT ROW_NUMBER() OVER(ORDER BY X.n%10) FROM N4 AS X, N4 AS Y),
NET as (
select top 100 percent
n
,    p.ID   as ID
,    p.Col1 as Col1
,    case
when substring(p.Col1, n, 1) like '[0-9]'
then substring(p.Col1, n, 1)
else ''
end as Col1_Nbr
,case
when substring(p.Col1, n, 1) not like '[0-9]'
then substring(p.Col1, n, 1)
else ''
end as Col1_Str
from N5
join @tblmixed as p
on N5.n <= Len(p.col1)
where n <= 100
order by p.id, n)
SELECT DISTINCT
ID
,    COL1
, (select '' + CAST(InnerNet.Col1_Nbr AS NVARCHAR(100)) AS [text()]
from Net as InnerNet
Where InnerNet.ID = OuterNet.ID
order by InnerNet.n
for xml path('')) AS COL_NBR
, (select '' + CAST(InnerNet.Col1_Str AS NVARCHAR(100)) AS [text()]
from Net as InnerNet
Where InnerNet.ID = OuterNet.ID
order by InnerNet.n
for xml path('')) AS COL_STR
FROM NET as OuterNet

We have simply used the result and then the XML PATH (”) technique to break the output into two separate columns (one for COL_NBR and the other one for COL_STR) for every string value that we had.  The output is this:

ID          COL1               COL_NBR         COL_STR
------------------------------ ---------       -------
1           96578Whitelabel    96578           Whitelabel
2           101Bluelabel567890 101567890       Bluelabel
3           Red15label46890    1546890         Redlabel

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: