Split String at First Occurrence of an Integer using R(使用R在第一次出现整数时拆分字符串)

Note I have already read Split string at first occurrence of an integer in a string however my request is different because I would like to use R.

Suppose I have the following example data frame:

> df = data.frame(name_and_address =
      c("Mr. Smith12 Some street",
        "Mr. Jones345 Another street",
        "Mr. Anderson6 A different street"))
> df
                  name_and_address
1          Mr. Smith12 Some street
2      Mr. Jones345 Another street
3 Mr. Anderson6 A different street

I would like to split the string at the first occurrence of an integer. Notice that the integers are of varying length.

The desired output can be like the following:

[[1]]
[1] "Mr. Smith"
[2] "12 Some street",

[[2]]
[1] "Mr. Jones"
[2] "345 Another street",

[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"

I have tried the following but I can not get the regular expression correct:

# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\\d+)', perl=TRUE, type.convert=TRUE)

# Attempt 2 (Does not work)
library(stringr)
str_split(fha_ltc, "\\d+")

Solution:

You can use tidyr::extract:

library(tidyr)
df <- df %>% 
    extract("name_and_address", c("name", "address"), "(\\D*)(\\d.*)")
## => df
##           name              address
## 1    Mr. Smith       12 Some street
## 2    Mr. Jones   345 Another street
## 3 Mr. Anderson 6 A different street

The (\D*)(\d.*) regex matches the following:

  • (\D*) – Group 1: any zero or more non-digit chars
  • (\d.*) – Group 2: a digit and then any zero or more chars as many as possible.

Another solution with stringr::str_split is also possible:

str_split(df$name_and_address, "(?=\\d)", n=2)
## => [[1]]
## [1] "Mr. Smith"      "12 Some street"

## [[2]]
## [1] "Mr. Jones"          "345 Another street"

## [[3]]
## [1] "Mr. Anderson"         "6 A different street"

The (?=\d) positive lookahead finds a location before a digit, and n=2 tells stringr::str_split to only split into 2 chunks max.

————————

注:我已经在字符串中第一次出现整数时读取了拆分字符串,但是我的请求不同,因为我想使用R。

假设我有以下示例数据框:

> df = data.frame(name_and_address =
      c("Mr. Smith12 Some street",
        "Mr. Jones345 Another street",
        "Mr. Anderson6 A different street"))
> df
                  name_and_address
1          Mr. Smith12 Some street
2      Mr. Jones345 Another street
3 Mr. Anderson6 A different street

我想在第一次出现整数时拆分字符串。请注意,整数的长度是不同的。

所需的输出可以如下所示:

[[1]]
[1] "Mr. Smith"
[2] "12 Some street",

[[2]]
[1] "Mr. Jones"
[2] "345 Another street",

[[3]]
[1] "Mr. Anderson"
[2] "6 A different street"

我尝试了以下方法,但无法获得正确的正则表达式:

# Attempt 1 (Does not work)
library(data.table)
tstrsplit(df,'(?=\\d+)', perl=TRUE, type.convert=TRUE)

# Attempt 2 (Does not work)
library(stringr)
str_split(fha_ltc, "\\d+")

解决方法:

您可以使用tidyr::extract:

library(tidyr)
df <- df %>% 
    extract("name_and_address", c("name", "address"), "(\\D*)(\\d.*)")
## => df
##           name              address
## 1    Mr. Smith       12 Some street
## 2    Mr. Jones   345 Another street
## 3 Mr. Anderson 6 A different street

(\D*)(\D*)正则表达式与以下内容匹配:

  • (\D*)-第1组:任何零个或更多非数字字符
  • (\d.*)第2组:一个数字,然后是尽可能多的零个或多个字符。

stringr::str_split的另一个解决方案也是可能的:

str_split(df$name_and_address, "(?=\\d)", n=2)
## => [[1]]
## [1] "Mr. Smith"      "12 Some street"

## [[2]]
## [1] "Mr. Jones"          "345 Another street"

## [[3]]
## [1] "Mr. Anderson"         "6 A different street"

(=\d)正向前瞻在一个数字之前找到一个位置,n=2告诉stringr::str_split最多只能拆分为两个块。