11Java正则表达式

发表于 2015-10-05 分类于 Java

推荐：

重用正则总结

1	([a-zA-Z -])([0-9]) # 利用捕获组匹配，可匹配字符串示例："abc90"、"90"，第一组为字符串（空或abc），第二组为数字（90）。

正则表达式

定义

正则表达式定义了字符串的模式。
正则表达式可以用来搜索、编辑或处理文本。
正则表达式并不仅限于某一种语言，但是在每种语言中有细微的差别。

转义

在 Java 中，\\ 表示：我要插入一个正则表达式的反斜线，所以其后的字符具有特殊意义。
在其他语言中（如Perl），一个反斜杠 \ 就足以具有转义的作用，而在 Java 中则需要有两个反斜杠才能被解析为其他语言中的转义作用。
Java 的正则表达式中，两个 \\ 代表其他语言中的一个 \，这也就是为什么表示一位数字的正则表达式是 \\d，而表示一个普通的反斜杠是 \\\\。

常用元字符

^ 、 $ 、 * 、 + 、 ? 、 [a-z] 、 \w 、 \W 、 {n, m}

常见规则

A:字符
    x 字符 x。 举例：'a'表示字符a
    \\ 反斜线字符。 两个杠才能表示一个杠，杠具有转义作用。
    \n 新行（换行）符 ('\u000A')
    \r 回车符 ('\u000D')
B:字符类
    [abc] a、b 或 c（简单类）
    [^abc] 任何字符，除了 a、b 或 c（否定）
    [a-zA-Z] a到 z 或 A到 Z，两头的字母包括在内（范围）
    [0-9] 0到9的字符都包括
C:预定义字符类
    . 任何字符（与行结束符可能匹配也可能不匹配）。'.' 字符本身，怎么表示呢? \.
    \d 数字：[0-9]
    \w 单词字符：[a-zA-Z_0-9]
D:边界匹配器
    ^ 行的开头
    $ 行的结尾
    \b 单词边界
        不是单词字符的地方就是单词边界。
        举例：hello world?haha;xixi 有三个单词边界
E:Greedy 数量词
    X? X，一次或一次也没有
    X* X，零次或多次
    X+ X，一次或多次
    X{n} X，恰好 n 次
    X{n,} X，至少 n 次
    X{n,m} X，至少 n 次，但是不超过 m 次

String类

String 类中有几个可以使用正则的方法，实际都是通过调用 Pattern、Matcher 类实现的

// 判断功能
public boolean matches(String regex)
    
// 替换功能
public String replaceFirst(String regex, String replacement)
public String replaceAll(String regex, String replacement)
    
// 分割功能
public String[] split(String regex)
public String[] split(String regex, int limit)

Pattern和Matcher类

Pattern 类：正则表达式的编译表示，没有公共构造方法。

使用方法：正则表达式字符串先被编译为此类的实例，然后用得到的 Pattern 对象创建 Matcher 对象。执行 matcher 方法后的所有匹配都驻留在匹配器 Matcher 中，所以多个匹配器可以共享同一模式。

public static Pattern compile(String regex) // 将给定的正则表达式编译成一个模式。
public static Pattern compile(String regex, int flags) // 将给定的正则表达式编译成具有给定标志的模式。
public Matcher matcher(CharSequence input) // 创建一个匹配器，将给定输入与此模式匹配。
public static boolean matches(String regex, CharSequence input) // 编译给定的正则表达式并尝试匹配给定的输入。

Matcher 类：没有公共构造方法。

通过解释 Pattern 对字符序列执行匹配操作的引擎。

匹配器通过调用 Pattern 的 matcher 方法创建一个 Matcher 对象。创建后，Matcher 可用于执行三种不同类型的匹配操作：

matches 方法尝试将整个输入序列与模式匹配。
lookingAt 方法尝试将输入序列与模式匹配，从头开始。
find 方法扫描输入序列以查找与模式匹配的下一个子序列。

每个方法都返回一个表示成功或失败的布尔值。通过查询匹配器的状态可以获取关于成功匹配的更多信息。

如果匹配成功，则可以通过 start、end 和 group 方法获取更多信息。

1
2
3

public boolean matches()：尝试将整个区域与模式匹配。
public boolean find():尝试查找与该模式匹配的输入序列的下一个子序列。 如果匹配成功，则可以通过 start、end 和 group 方法获取更多信息。 
public String group():返回由以前匹配操作所匹配的输入子序列。以前匹配操作所匹配的字符串形式的子序列（可能为空）。

PatternSyntaxException 类：一个非强制异常类，它表示一个正则表达式模式中的语法错误。

具体使用说明及更多方法见 API。

典型调用顺序

Pattern p = Pattern.compile("a*b"); // 正则表达式字符串先被编译为Pattern实例。
Matcher m = p.matcher("aaaaab"); // 用得到的Pattern对象创建Matcher对象，所有状态都驻留在匹配器Matcher中。多个匹配器可以共享同一模式。
boolean b = m.matches();

Pattern.compile(regex).matcher(input).matches()

// 等同于
boolean b = Pattern.matches("a*b", "aaaaab")

练习

/*
* 获取功能
*        Pattern和Matcher类的使用
*        模式和匹配器的基本使用顺序
*/
public class RegexDemo {
    public static void main(String[] args) {
        // 模式和匹配器的典型调用顺序
        Pattern p = Pattern.compile("a*b");// 先吧正则表达式字符串编译为Pattern实例。
        Matcher m = p.matcher("aaaaab");// 用得到的Pattern对象创建Matcher对象，所有状态都驻留在匹配器Matcher中。
        boolean b = m.matches(); // 调用匹配器对象的功能
        // boolean b = Pattern.compile("a*b").matcher("aaaaab").matches(); // 链式调用 Pattern.compile(regex).matcher(input).matches()
        // boolean b = Pattern.matches("a*b", "aaaaab"); // 简化调用 Pattern.matches(regex, input)
        System.out.println(b); // true

        boolean bb = p.matcher("aaaaa").matches();// 多个匹配器可以共享同一模式。
        System.out.println(bb); // false

        // 用String类
        String str = "aaaaab";
        String regex = "a*b";
        boolean cc = str.matches(regex); // 与上面的"链式调用"和"简化调用"行为相同
        System.out.println(cc);// true
    }
}

捕获组

捕获组是把多个字符当一个单独单元进行处理的方法，它通过对括号内的字符分组来创建。
捕获组是通过从左至右计算其开括号来编号。例如，在表达式 ((A)(B(C)))，有四个这样的组：((A)(B(C)))、(A)、(B(C))、(C)。
Matcher 类的 groupCount 方法返回一个 int 值，表示 Matcher 对象有多个捕获组。
group(0) 是一个特殊的组，代表整个表达式。该组不包括在 groupCount 的返回值中。

用法：

public class RegexDemo {
    public static void main(String[] args) {
        String line = "This order was placed for QT3000! OK?";
        Pattern pattern = Pattern.compile("(\\D*)(\\d+)(.*)");
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            System.out.println("Found value: " + matcher.group(0));
            System.out.println("Found value: " + matcher.group(1));
            System.out.println("Found value: " + matcher.group(2));
            System.out.println("Found value: " + matcher.group(3));
        } else {
            System.out.println("NO MATCH");
        }
    }
}

输出：

Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT
Found value: 3000
Found value: ! OK?

正则注入(regex injection)

定义

攻击者可能会通过恶意构造的输入对初始化的正则表达式进行修改，比如导致正则表达式不符合程序规定要求；可能会影响控制流，导致信息泄露，或导致ReDos攻击。

避免使用不可信数据构造正则表达式。

利用方式

匹配标志：不可信的输入可能覆盖匹配选项，然后有可能会被传给 Pattern.compile() 方法。
贪赞：一个非受信的输入可能试图注入一个正则表达式，通过它来改变初始的那个正则表达式，从而匹配尽可能多的字符串，从而暴露敏感信息。
分组：程序员会用括号包括一部分的正则表达式以完成一组动作中某些共同的部分。攻击者可能通过提供非受信的输入来改变这种分组。

输入校验

非受信的输入应该在使用前净化，从而防止发生正则表达式注入。
当用户必须指定正则表达式作为输入时，必须注意需要保证初始的正则表达式没有被无限制修改。
在用户输入字符串提交给正则解析之前，进行白名单字符处理 (比如字母和数字)。
开发人员必须仅仅提供有限的正则表达式功能给用户，从而减少被误用的可能。

ReDos攻击

正则表达式拒绝服务( ReDoS ) 是一种算法复杂性攻击，它通过提供 正则表达式 或 需要很长时间评估的输入 来产生拒绝服务（即：通过提供特制的正则表达式或输入来使程序花费大量时间，消耗系统资源，然后程序将变慢或变得无响应）。

ReDos攻击概述

JDK 中提供的正则匹配使用的是 NFA 引擎。
NFA 引擎具有回溯机制（一个字符可能尝试多次匹配），匹配失败时花费时间很大。正则表达式回溯法原理
当使用简单的非分组正则表达式时，是不会导致ReDos攻击的。

潜在危险

包含具有自我重复的重复性分组的正则
举例：^(\d+)+$、^(\d*)*$、^(\d+)*$、^(\d+|\s+)*$
包含替换的重复性分组
举例：^(\d|\d|\d)+$、^(\d|\d?)+$

当输入字符串为 1111111111111111111111x1 时，正则表达式 ^(\d+)+$ 就会不断进行失败重试，从而耗死CPU计算。

解析： \d+ 表示匹配一个或多个数字； ()+ 表示分组本身也匹配一个或多个；那么匹配字符串 1111111111111111111111x1 就会进行非常多的尝试，从而导致CPU资源枯竭。

规避猎施

进行正则匹配前，先对匹配的文本的长度进行校验。
在编写正则时，尽量不要使用过于复杂的正则，越复杂越容易有缺陷。
在编写正则时，尽量减少分组的使用。
避免动态构建正则（因为难以判断是否有性能问题），当使用不可信数据构造正则时，要使用白名单进行严格校验。

案例

校验qq号码

/*
* 校验qq号码.
*         1:要求必须是5-15位数字
*         2:0不能开头
* 分析：
*         A:键盘录入一个QQ号码
*         B:写一个功能实现校验
*         C:调用功能，输出结果。
*/
public class RegexDemo {
    public static void main(String[] args) {
        // 创建键盘录入对象
        Scanner sc = new Scanner(System.in);
        System.out.println("请输入你的QQ号码：");
        String qq = sc.nextLine();
        System.out.println("checkQQ:" + checkQQ(qq));
    }

    /*
     * 写一个功能实现校验两个明确：明确返回值类型：boolean 明确参数列表：String qq
     */
    public static boolean checkQQ(String qq) {
        boolean flag = true;
        // 校验长度
        if (qq.length() >= 5 && qq.length() <= 15) {
            // 0不能开头
            if (!qq.startsWith("0")) {
                // 必须是数字
                char[] chs = qq.toCharArray();
                for (int x = 0; x < chs.length; x++) {
                    char ch = chs[x];
                    if (!Character.isDigit(ch)) {
                        flag = false;
                        break;
                    }
                }
            } else {
                flag = false;
            }
        } else {
            flag = false;
        }
        return flag;
    }
}

用正则表达式改进：

public class RegexDemo {
    public static void main(String[] args) {
        // 创建键盘录入对象
        Scanner sc = new Scanner(System.in);
        System.out.println("请输入你的QQ号码：");
        String qq = sc.nextLine();
        System.out.println("checkQQ:" + checkQQ(qq));
    }

    public static boolean checkQQ(String qq) {
        // String类的 public boolean matches(String regex)告知此字符串是否匹配给定的正则表达式
        // return qq.matches("[1-9][0-9]{4,14}");
        return qq.matches("[1-9]\\d{4,14}");
    }
}

校验电话号码和邮箱

按照不同的规则分割数据

一：

/*
* 分割功能
*        String类的public String[] split(String regex):根据给定正则表达式的匹配拆分此字符串。
* 举例：
*         百合网，世纪佳缘,珍爱网,QQ
*         搜索好友
*             性别：女
*             范围："18-24"
*         age>=18 && age<=24
*/
public class RegexDemo {
    public static void main(String[] args) {
        // 定义一个年龄搜索范围
        String ages = "18-24";
        // 定义规则
        String regex = "-";
        // 调用方法
        String[] strArray = ages.split(regex);
        // //遍历
        for (int x = 0; x < strArray.length; x++) {
            System.out.println(strArray[x]);
        }
        // 如何得到int类型的呢?
        int startAge = Integer.parseInt(strArray[0]);
        int endAge = Integer.parseInt(strArray[1]);
        // 键盘录入年龄
        Scanner sc = new Scanner(System.in);
        System.out.println("请输入你的年龄：");
        int age = sc.nextInt();
        if (age >= startAge && age <= endAge) {
            System.out.println("你就是我想找的");
        } else {
            System.out.println("不符合我的要求，gun");
        }
    }
}

二：

/*
* 分割功能练习
*/
public class RegexDemo {
    public static void main(String[] args) {
        // 定义一个字符串
        String s1 = "aa,bb,cc";
        // 直接分割
        String[] str1Array = s1.split(",");
        for (int x = 0; x < str1Array.length; x++) {
            System.out.println(str1Array[x]);
        }
        String s2 = "aa.bb.cc";
        String[] str2Array = s2.split("\\.");
        for (int x = 0; x < str2Array.length; x++) {
            System.out.println(str2Array[x]);
        }
        String s3 = "aa bb cc";
        String[] str3Array = s3.split(" +");
        for (int x = 0; x < str3Array.length; x++) {
            System.out.println(str3Array[x]);
        }
        // 硬盘上的路径，我们应该用\\替代\
        String s4 = "E:\\JavaSE\\day14\\avi";
        String[] str4Array = s4.split("\\\\");// 两个杠代表一个杠
        for (int x = 0; x < str4Array.length; x++) {
            System.out.println(str4Array[x]);
        }
    }
}

三：

/*
* 我有如下一个字符串:"91 27 46 38 50"
* 请写代码实现最终输出结果是："27 38 46 50 91"
* 分析：
*         A:定义一个字符串
*         B:把字符串进行分割，得到一个字符串数组
*         C:把字符串数组变换成int数组
*         D:对int数组排序
*         E:把排序后的int数组在组装成一个字符串
*         F:输出字符串
*/
public class RegexDemo {
    public static void main(String[] args) {
        String s = "91 27 46 38 50";
        String[] strArray = s.split(" ");
        int[] arr = new int[strArray.length];
        for (int x = 0; x < arr.length; x++) {
            arr[x] = Integer.parseInt(strArray[x]);
        }
        Arrays.sort(arr);
        // String result1 = Arrays.toString(arr);
        // System.out.println("result1:"+result1);// result1:[27, 38, 46, 50, 91]
        StringBuilder sb = new StringBuilder();
        for (int x = 0; x < arr.length; x++) {
            sb.append(arr[x]).append(" ");
        }
        String result2 = sb.toString().trim();
        System.out.println("result2:" + result2);// result2:27 38 46 50 91
    }
}

把论坛中的数字替换为*

/*
* 替换功能
*     String类的public String replaceAll(String regex,String replacement)
*     使用给定的 replacement 替换此字符串所有匹配给定的正则表达式的子字符串。
*/
public class RegexDemo {
    public static void main(String[] args) {
        String str = "hello!qq:12345;world!kh:622112345678;java!";
        // 我要去除所有的数字,用*给替换掉
        String regex1 = "\\d+";
        String ss1 = "*";
        String result1 = str.replaceAll(regex1, ss1);
        System.out.println(result1); // hello!qq:*;world!kh:*;java!
        String regex2 = "\\d";
        String ss2 = "*";
        String result2 = str.replaceAll(regex2, ss2);
        System.out.println(result2); // hello!qq:*****;world!kh:************;java!
        // 直接把数字干掉
        String regex3 = "\\d+";
        String ss3 = "";
        String result3 = str.replaceAll(regex3, ss3);
        System.out.println(result3); // hello!qq:;world!kh:;java!
    }
}

获取字符串中由3个字符组成的单词

/*
* 获取功能：
* 获取下面这个字符串中由三个字符组成的单词
* da jia ting wo shuo,jin tian yao xia yu,bu shang wan zi xi,gao xing bu?
*/
public class RegexDemo {
    public static void main(String[] args) {
        String s = "da jia ting wo shuo,jin tian yao xia yu,bu shang wan zi xi,gao xing bu?";
        String regex = "\\b\\w{3}\\b";// 正则表达式
        Pattern p = Pattern.compile(regex);// 把正则表达式编译成模式对象
        Matcher m = p.matcher(s);// 通过模式对象得到匹配器对象
        // 调用匹配器对象的功能,通过find方法就是查找有没有满足条件的子串:public boolean find()
        while (m.find()) {
            // 如何得到值呢? public String group()
            System.out.println(m.group());
        }
        // 注意：一定要先find()，然后才能group()
    }
}