Scratch

jvvg · 2012-11-13 22:06:28

I am currently writing an experimental XHTML corrector that will be used on Mod Share. It is designed to help us detect fundamental flaws in the code. I am publishing the code for anyone else to use and help develop.

Please note that this is designed for XHTML 1.0 Strict.

Mainly, it does this:
-Detects empty HTML and removes it
-Detects nesting block tags inside inline tags and throws an error in that case
-Detects neglecting to close self-close tags
-Detects deprecated tags

What I need help with:
-Making it cooler and check more stuff (like tag order)
-Improving tag lists

Here's the code:

Code:

//the variable $output contains the data to be corrected
function printerrorregion($content, $error) {
    $lines = explode("\n", htmlspecialchars($content));
    $error = htmlspecialchars($error);
    echo '<pre>';
    foreach ($lines as $i => $val) {
        if (strstr($val, $error)) {
            $val = str_replace($error, '<b style="color:#F00">' . $error . '</b>', $val);
            echo sprintf('%-4d: %s', $i - 2, $lines[$i - 2]) . "\n";
            echo sprintf('%-4d: %s', $i - 1, $lines[$i - 1]) . "\n";
            echo sprintf('%-4d: %s', $i, $val) . "\n";
            echo sprintf('%-4d: %s', $i + 1, $lines[$i + 1]) . "\n";
            echo sprintf('%-4d: %s', $i + 2, $lines[$i + 2]);
        }
    }
    echo '</pre>';
}
$output = preg_replace('%<(tr|td|a|b|i|u|em|strong|h[0-9]|div|p)></\1>%', '', $output); //remove empty HTML code
$block_tags = array('div', 'p', 'h[0-9]');
$inline_tags = array('a', 'b', 'i', 'u', 's');
$selfclose_tags = array('br', 'hr', 'img');
$deprecated_tags = array('font', 'center');
//check for block tags inside inline ones
$pattern = '%<(' . implode('|', $inline_tags) . ')>(.*?)</\1>%';
if (preg_match_all($pattern, $output, $matches)) {
    foreach ($matches[2] as $key => $val) {
        if (preg_match('%<(' . implode('|', $block_tags) . ')(.*?)>%', $val, $newmatches)) {
            echo '<p>You used a block tag in an inline tag. More specifically, a &lt;' . $newmatches[1] . '&gt; tag inside a &lt;' . $matches[1][$key] . '&gt; tag</p>';
            printerrorregion($output, $val);
            die;
        }
    }
}
preg_match_all('%<(' . implode('|', $selfclose_tags) . ')(.*?)>%', $output, $selfclosecheckmatches1);
preg_match_all('%<(' . implode('|', $selfclose_tags) . ')(.*?)/>%', $output, $selfclosecheckmatches2);
preg_match_all('%<(' . implode('|', $selfclose_tags) . ')(.*?)></\1>%', $output, $selfclosecheckmatches3);
if (count($selfclosecheckmatches1[0]) > count($selfclosecheckmatches2[0]) + count($selfclosecheckmatches3[0])) {
    foreach ($selfclosecheckmatches1[0] as $val) {
        if (!preg_match('%<(' . implode('|', $selfclose_tags) . ')(.*?) />%', $val) && !preg_match('%<(' . implode('|', $selfclose_tags) . ')(.*?)></\1>%', $val)) {
            $tag = preg_replace('%^<(.*?)>(.*)%', '$1', $val);
            echo '<p>You forgot to close a tag: &lt;' . $tag . ' /&gt;.</p>';
            printerrorregion($output, $val);
            die;
        }
    }
}
if (preg_match('%<(' . implode('|', $deprecated_tags) . ')(.*?)>%', $output, $matches)) {
    echo '<p>&lt;' . $matches[1] . '&gt; is a deprecated tag that has been detected in your code. Get rid of it right now.</p>'; 
    printerrorregion($output, $matches[0]);
    die;
}

Last edited by jvvg (2012-11-13 22:28:24)

nXIII · 2012-11-13 22:25:30

Just use HTML5 tongue

Code:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Minimal XHTML 1.0 Document</title>
  </head>
  <body></body>
</html>

Code:

<!DOCTYPE html>
<title>Minimal HTML5 Document</title>

jvvg · 2012-11-13 22:29:53

nXIII wrote:

Just use HTML5 tongue

Code:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Minimal XHTML 1.0 Document</title>
  </head>
  <body></body>
</html>

Code:

<!DOCTYPE html>
<title>Minimal HTML5 Document</title>

XHTML is a lot better because it requires good syntax. That's why I like it. I don't think that designers should be allowed to do stuff like not closing tags. While the document is a bit more complex, it is a LOT more specific in what it is doing and much easier for a computer to parse. It knows where the document head stuff is, exactly what type of data it is trying to parse, the XML information, etc.

nXIII · 2012-11-13 23:13:51

jvvg wrote:
XHTML is a lot better because it requires good syntax. That's why I like it. I don't think that designers should be allowed to do stuff like not closing tags. While the document is a bit more complex, it is a LOT more specific in what it is doing and much easier for a computer to parse. It knows where the document head stuff is, exactly what type of data it is trying to parse, the XML information, etc.

It doesn't matter how easy it is for a computer to parse. They're getting by quite fine with all of the web pages out there, and the parsing behavior for HTML5 is well-defined. It's not that XHTML has a "good syntax" and HTML5 has a "bad syntax;" it's just that HTML5 has a cleaner and more readable syntax. I like XML, but HTML5 is a much more appropriate tool for hand-authoring documents than XHTML.

XHTML 1.0 Strict is also missing the new tags in HTML5 like <article>, <section>, and <header> which allow authors to accurately encode the structure of a document, and the media tags <audio> and <video>.

This:

Code:

<!doctype html>
<meta charset=utf-8>
<title>Title</title>
<header>
    <nav>
        <ul>
            <li><a href=home>Home</a>
            <li><a href=forums>Forums</a>
            <li><a href=about>About</a>
        </ul>
    </nav>
</header>
<table>
    <tr><td>A
        <td>B
    <tr><td>C
        <td>D
</table>
<form method=POST action=submit>
    <label>Name: <input name=name></label>
    <p>Gender:</p>
    <label><input type=radio name=gender> Male</label>
    <label><input type=radio name=gender> Female</label>
</form>

Is a lot easier to write than this:

Code:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
        <title>Title</title>
    </head>
    <body>
        <div id="header">
            <div class="navigation">
                <ul>
                    <li><a href="home">Home</li>
                    <li><a href="forums">Forums</li>
                    <li><a href="about">About</li>
                </ul>
            </div>
        </div>
        <table>
            <tbody>
                <tr>
                    <td>A</td>
                    <td>B</td>
                </tr>
                <tr>
                    <td>C</td>
                    <td>D</td>
                </tr>
            </tbody>
        </table>
        <form method="POST" action="submit">
            <p>
                <label for="name">Name:</label>
                <input type="text" name="name" id="name" />
            </p>
            <p>Gender:</p>
            <p>
                <input type="radio" name="gender" id="gender-male" />
                <label for="gender-male">Male</label>
            </p>
            <p>
                <input type="radio" name="gender" id="gender-female" />
                <label for="gender-female">Female</label>
            </p>
        </form>
    </body>
</html>

Last edited by nXIII (2012-11-13 23:18:58)

jvvg · 2012-11-14 12:04:09

nXIII wrote:
jvvg wrote:
XHTML is a lot better because it requires good syntax. That's why I like it. I don't think that designers should be allowed to do stuff like not closing tags. While the document is a bit more complex, it is a LOT more specific in what it is doing and much easier for a computer to parse. It knows where the document head stuff is, exactly what type of data it is trying to parse, the XML information, etc.
It doesn't matter how easy it is for a computer to parse. They're getting by quite fine with all of the web pages out there, and the parsing behavior for HTML5 is well-defined. It's not that XHTML has a "good syntax" and HTML5 has a "bad syntax;" it's just that HTML5 has a cleaner and more readable syntax. I like XML, but HTML5 is a much more appropriate tool for hand-authoring documents than XHTML.

XHTML 1.0 Strict is also missing the new tags in HTML5 like <article>, <section>, and <header> which allow authors to accurately encode the structure of a document, and the media tags <audio> and <video>.

This:
Code:
<!doctype html>
<meta charset=utf-8>
<title>Title</title>
<header>
    <nav>
        <ul>
            <li><a href=home>Home</a>
            <li><a href=forums>Forums</a>
            <li><a href=about>About</a>
        </ul>
    </nav>
</header>
<table>
    <tr><td>A
        <td>B
    <tr><td>C
        <td>D
</table>
<form method=POST action=submit>
    <label>Name: <input name=name></label>
    <p>Gender:</p>
    <label><input type=radio name=gender> Male</label>
    <label><input type=radio name=gender> Female</label>
</form>
Is a lot easier to write than this:
Code:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
    PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
        <title>Title</title>
    </head>
    <body>
        <div id="header">
            <div class="navigation">
                <ul>
                    <li><a href="home">Home</li>
                    <li><a href="forums">Forums</li>
                    <li><a href="about">About</li>
                </ul>
            </div>
        </div>
        <table>
            <tbody>
                <tr>
                    <td>A</td>
                    <td>B</td>
                </tr>
                <tr>
                    <td>C</td>
                    <td>D</td>
                </tr>
            </tbody>
        </table>
        <form method="POST" action="submit">
            <p>
                <label for="name">Name:</label>
                <input type="text" name="name" id="name" />
            </p>
            <p>Gender:</p>
            <p>
                <input type="radio" name="gender" id="gender-male" />
                <label for="gender-male">Male</label>
            </p>
            <p>
                <input type="radio" name="gender" id="gender-female" />
                <label for="gender-female">Female</label>
            </p>
        </form>
    </body>
</html>

HTML5 actually does NOT have a cleaner syntax. Any language in which you don't have to close all of your tags, provide a complete structure for the documents, provide attributes for values, etc. is sloppy.

nXIII · 2012-11-14 17:10:15

jvvg wrote:
HTML5 actually does NOT have a cleaner syntax. Any language in which you don't have to close all of your tags, provide a complete structure for the documents, provide attributes for values, etc. is sloppy.

Firstly, "cleaner" is a subjective measure, so you really can't say with (permacapped) certainty that something is clean or not clean. I was simply trying to provide my thoughts on why HTML5 is cleaner. To address your concerns:

1) You don't "have to close all of your tags"
The only situations in which you don't have to close tags are situations in which the meaning is unambiguous: the start of one tag implicitly provides the end of another. For example, you cannot directly nest list items (i.e. <li><li>...</li></li>), so there's no reason to provide a closing tag: the next <li> element (or the end of the list) clearly indicates the end of the tag.

Code:

<ul>
    <li>1
    <li>2
    <li>3
</ul>

Is exactly the same as:

Code:

<ul>
    <li>1
    </li><li>2
    </li><li>3
</li></ul>

Which normalizes to:

Code:

<ul><li>1</li><li>2</li><li>3</li></ul>

Which is exactly the sames as the one with closing tags.

Not requiring implicit end tags is a convenience to programmers, like string literals in Java: You could explicitly instantiate strings from character arrays, but most people think it's a lot easier and neater to use the implicit mechanism in the language (viz. double-quoting).

2) You don't have to "provide a complete structure for documents"
Actually, you do. It's just that the minimal structure of an HTML5 document has a lot fewer characters than that of an XHTML document.

3) You don't have to "provide attributes for values"
I assume here you mean you don't have to provide values for boolean attributes. I've always considered this a weakness of X(HT)ML, rather than a strength, because the value of a boolean attribute is indicated by its presence, not its value: requiring programmers to provide a value is misleading and redundant. The HTML way (providing an attribute name with no associated value) indicates boolean attributes clearly to readers, and avoids the need to include non-semantic attribute values for the sake of fitting into XML's syntax.

jvvg · 2012-11-14 17:29:31

nXIII wrote:
jvvg wrote:
HTML5 actually does NOT have a cleaner syntax. Any language in which you don't have to close all of your tags, provide a complete structure for the documents, provide attributes for values, etc. is sloppy.
Firstly, "cleaner" is a subjective measure, so you really can't say with (permacapped) certainty that something is clean or not clean. I was simply trying to provide my thoughts on why HTML5 is cleaner. To address your concerns:

1) You don't "have to close all of your tags"
The only situations in which you don't have to close tags are situations in which the meaning is unambiguous: the start of one tag implicitly provides the end of another. For example, you cannot directly nest list items (i.e. <li><li>...</li></li>), so there's no reason to provide a closing tag: the next <li> element (or the end of the list) clearly indicates the end of the tag.
Code:
<ul>
    <li>1
    <li>2
    <li>3
</ul>
Is exactly the same as:
Code:
<ul>
    <li>1
    </li><li>2
    </li><li>3
</li></ul>
Which normalizes to:
Code:
<ul><li>1</li><li>2</li><li>3</li></ul>
Which is exactly the sames as the one with closing tags.

Not requiring implicit end tags is a convenience to programmers, like string literals in Java: You could explicitly instantiate strings from character arrays, but most people think it's a lot easier and neater to use the implicit mechanism in the language (viz. double-quoting).

2) You don't have to "provide a complete structure for documents"
Actually, you do. It's just that the minimal structure of an HTML5 document has a lot fewer characters than that of an XHTML document.

3) You don't have to "provide attributes for values"
I assume here you mean you don't have to provide values for boolean attributes. I've always considered this a weakness of X(HT)ML, rather than a strength, because the value of a boolean attribute is indicated by its presence, not its value: requiring programmers to provide a value is misleading and redundant. The HTML way (providing an attribute name with no associated value) indicates boolean attributes clearly to readers, and avoids the need to include non-semantic attribute values for the sake of fitting into XML's syntax.

Clean in this case is not subjective. Just saying <br> or <p>test<p>test 2 is NOT clean. It is incredibly sloppy. It is a sign that you are lazy and don't care about making your code structured.

Even if the meaning of not leaving out a tag is "unambiguous", it is still necessary to close it. Otherwise, it is still sloppy. Think about it this way: when writing a program in C, even if it's obvious that the else statement is applying to a certain IF statement, you still have to close the first one. Not requiring the tags makes the code sloppy.

2. Putting the title, content and CSS all in the same part of a document really is not a good way to be organized. It clutters up the document and makes it harder to understand where everything goes.

3. Just saying '<input type="text" disabled>' is very weak syntax. What does "disabled" mean? Even better, when saying element.disabled = '' in JavaScript, that makes the element NOT disabled, even though you are technically giving it the value it had when it WAS disabled. You always need to be clear about what you are saying.

Also, I think that this thread is about my RegEx parser, not about arguing about whether XHTML or HTML5 is better.

MathWizz · 2012-11-14 17:47:41

For #3, when a tag is present without a value in HTML5, it is true. It does not get assigned an empty string. I see nothing wrong with this.

And... Empty HTML is useful.

jvvg · 2012-11-14 17:50:14

MathWizz wrote:
For #3, when a tag is present without a value in HTML5, it is true. It does not get assigned an empty string. I see nothing wrong with this.

And... Empty HTML is useful.

Empty HTML can be useful, but in the cases of Mod Share, it is not. One of the scripts generates <tr></tr>, and we're using this until one of us can write something better.

nXIII · 2012-11-14 17:51:04

jvvg wrote:
Clean in this case is not subjective. Just saying <br> or <p>test<p>test 2 is NOT clean. It is incredibly sloppy. It is a sign that you are lazy and don't care about making your code structured.

Of course it's lazy! It's not sloppy, though: it's precise and it's well-structured. It just has fewer characters.

Even if the meaning of not leaving out a tag is "unambiguous", it is still necessary to close it.

Nope.

Think about it this way: when writing a program in C, even if it's obvious that the else statement is applying to a certain IF statement, you still have to close the first one.

You're talking about it being obvious to the programmer, but not obvious to the computer. That's not the case in HTML5, where it is obvious to both the programmer and the computer.

2. Putting the title, content and CSS all in the same part of a document really is not a good way to be organized. It clutters up the document and makes it harder to understand where everything goes.

You mean this:

Code:

<!doctype html>
<meta charset=utf-8>
<title>My Page</title>

<nav>…</nav>
<article>…</article>
<aside>…</aside>

Is harder to understand than this?

Code:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html 
     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <head>
        <title>My Page</title>
    </head>
    <body>
        <div class="navigation">
            …
        </div>
        <div class="article">
            …
        </div>
        <div class="aside">
            …
        </div>
    </body>
</html>

3. Just saying '<input type="text" disabled>' is very weak syntax.

I'm not sure what you mean by "weak", but it's very explicitly defined.

What does "disabled" mean?

This.

Even better, when saying element.disabled = '' in JavaScript, that makes the element NOT disabled, even though you are technically giving it the value it had when it WAS disabled.

1) This is true for XHTML as well as HTML5, so I'm not sure what your point is here
2) This is well-defined behavior.

Also, I think that this thread is about my RegEx parser, not about arguing about whether XHTML or HTML5 is better.

You should use an XML parser

jvvg wrote:
Empty HTML can be useful, but in the cases of Mod Share, it is not. One of the scripts generates <tr></tr>, and we're using this until one of us can write something better.

…So you should check for <tr></tr>, not any empty HTML tag (for example, external <script> tags are empty and are an essential part of most HTML pages).

Last edited by nXIII (2012-11-14 17:56:27)

jvvg · 2012-11-14 17:55:39

nXIII wrote:
...lots of stuff...
Also, I think that this thread is about my RegEx parser, not about arguing about whether XHTML or HTML5 is better.

Please read that part again. This isn't an HTML5 vs. XHTML thread.

My final argument before I am going to start reporting off-topic posts: just because you can be sloppy doesn't mean you should.
For example, when speaking, people usually try to use good grammar. Even if using bad grammar, people can usually tell what you mean. People will still get what you mean if you say "Me and my brother walked to the store", but people still prefer to say "My brother and I walked to the store" because it is correct grammar and more logical.

nXIII · 2012-11-14 17:57:50

jvvg wrote:
nXIII wrote:
...lots of stuff...
Also, I think that this thread is about my RegEx parser, not about arguing about whether XHTML or HTML5 is better.
Please read that part again. This isn't an HTML5 vs. XHTML thread.

Please read my post again. The last two sections are about your parser. Also:

just because you can be sloppy doesn't mean you should.
For example, when speaking, people usually try to use good grammar. Even if using bad grammar, people can usually tell what you mean. People will still get what you mean if you say "Me and my brother walked to the store", but people still prefer to say "My brother and I walked to the store" because it is correct grammar and more logical.

grammar = syntax, and what you're describing is not sloppy: it's part of the syntax.

Last edited by nXIII (2012-11-14 18:00:20)

jvvg · 2012-11-14 18:00:03

nXIII wrote:
jvvg wrote:
nXIII wrote:
...lots of stuff...

Please read that part again. This isn't an HTML5 vs. XHTML thread.
Please read my post again. The last two sections are about your parser.

In that case, the other sections weren't really necessary.

About using an XML parser: I tried, but it doesn't recognize all of the character entities I use (e.g.  ) even though I verified that they are correct.

You will also notice that the script that removes empty stuff only runs on some tags.

nXIII · 2012-11-14 18:02:48

jvvg wrote:
You will also notice that the script that removes empty stuff only runs on some tags

empty <a> tags are useful, as are empty <div>s…

jvvg · 2012-11-14 18:18:03

nXIII wrote:
jvvg wrote:
You will also notice that the script that removes empty stuff only runs on some tags
empty <a> tags are useful, as are empty <div>s…

We haven't used them in that way yet, so I'll remove them if necessary. The only one that really needs to be in there for now is the <tr> tag.

Scratch

archived forums

#1 2012-11-13 22:06:28

Experimental XHTML corrector

Code:

#2 2012-11-13 22:25:30

Re: Experimental XHTML corrector

Code:

Code:

#3 2012-11-13 22:29:53

Re: Experimental XHTML corrector

nXIII wrote:

Code:

Code:

#4 2012-11-13 23:13:51

Re: Experimental XHTML corrector

jvvg wrote:

Code:

Code:

#5 2012-11-14 12:04:09

Re: Experimental XHTML corrector

nXIII wrote:

jvvg wrote:

Code:

Code:

#6 2012-11-14 17:10:15

Re: Experimental XHTML corrector

jvvg wrote:

Code:

Code:

Code:

#7 2012-11-14 17:29:31

Re: Experimental XHTML corrector

nXIII wrote:

jvvg wrote:

Code:

Code:

Code:

#8 2012-11-14 17:47:41

Re: Experimental XHTML corrector

#9 2012-11-14 17:50:14

Re: Experimental XHTML corrector

MathWizz wrote:

#10 2012-11-14 17:51:04

Re: Experimental XHTML corrector

jvvg wrote:

Code:

Code:

jvvg wrote:

#11 2012-11-14 17:55:39

Re: Experimental XHTML corrector

nXIII wrote:

#12 2012-11-14 17:57:50

Re: Experimental XHTML corrector

jvvg wrote:

nXIII wrote:

#13 2012-11-14 18:00:03

Re: Experimental XHTML corrector

nXIII wrote:

jvvg wrote:

nXIII wrote:

#14 2012-11-14 18:02:48

Re: Experimental XHTML corrector

jvvg wrote:

#15 2012-11-14 18:18:03

Re: Experimental XHTML corrector

nXIII wrote:

jvvg wrote:

Board footer